Page 135 - AIH-1-3
P. 135
Artificial Intelligence in Health ADRD caregiver experiences on Reddit
classification method for texts. In this study, we used 3. Results
17
BERTopic, a sentence-transformers model, for extracting
embedded document. Compared to previous methods 3.1. BERTTopic modeling output
such as Latent Dirichlet Allocation (LDA) modeling, A total of 1151 comments were collected from 15 Reddit
18
BERTopic incorporates the semantic context of words posts from our search results. Using BertTopic topic
19
and further fine-grained the method by considering the modeling and manual topic refinement, we categorized the
varying word semantic distance distributions. Similar to comments into six topics and provide example comments
19
the user interface of other topic models, it outputs topic for each topic in Table 1. Topic 0 was identified as “sharing
assignment for each comment, as well as the top words of caregiver stories,” topic 1 as “appreciation of online
each topic. The top words help us interpret the topics of the community,” topic 2 as “concerns of abuse of ADRD family
comments, while topic assignment lets us see how popular member,” topic 3 as “financial struggles of caregivers,” topic
each topic is, and it can also be used in the subsequent 4 as “early symptoms of ADRD of family member,” and topic
sentiment analysis. 5 as “symptoms of ADRD.” As seen in Table 1, the topic
having the greatest proportion of discussions was topic 0
Another difference between BERTopic and LDA (n = 926), followed by topic 1 (n = 126), topic 2 (n = 33),
modeling is that BERTopic determines the number of topic 3 (n = 31), topic 4 (n = 22), and topic 5 (n = 13).
topics by the text, while LDA relies on a user-defined
number of topics. 20,21 Using BERTopic, we generated 3.2. VADER (sentiment analysis) results
an intertopic distance map to determine the distance We used VADER to analyze the sentiment of the comments
(difference) between the topics. An intertopic distance map under each topic. Figure 1 describes the average VADER
represents each topic as a circle on Cartesian plane, whose sentiment score of the retrieved posts’ texts for each topic. In
coordinates represent semantic distance. If circles do not Figure 1, the x-axis corresponds to the VADER compound
overlap, it is considered that the topics are well separated. score that ranges from −1 to 1, where x<−0.05 represents
If not, the topic model will be refitted with an adequately negative sentiment, −0.05<x<0.05 represents neutral
smaller topic number, and the intertopic distance map sentiment, and x>0.05 represents positive sentiment. As
will be plotted again to see if the topics are well separated. described by the histogram bars in Figure 1, topic 3 is
The “step-size” of each refitting can vary depending on skewed to the right indicating more positive sentiment,
prior knowledge on the dataset. For example, in the case while topics 1 and 3 are skewed to the left indicating
where no more than 20 topics are expected in the text, and more negative sentiment. Figure 2 provides a direct
BERTopic model identifies more than 100 topics, the “step comparison of comment sentiment proportions. Topic 0
size” can be 5 – 10 less topics for next refitting, until topic had relatively equal proportions of positive and negative
separation appears, or that number of topics is reduced to sentiment, whereas topic 5 had the most proportion of
20. After that, the “step size” can be 1 less topic for each neutral sentiment and topic 3 had the highest proportion
refitting. of positive posts.
2.3. Sentiment analysis The top words in each topic are displayed in Table 2.
To understand the sentiment that a comment carries, we Topic 0 was the largest topic of posts and manually
performed sentiment analysis, which quantifies positive labeled as “shared stories by caregivers.” This topic
and negative sentiment. We adopted the most widely included stories that ADRD caregivers shared with other
used sentiment analysis, Valence Aware Dictionary for ADRD caregiving users on Reddit. Comments included
Sentiment Reasoning (VADER), for our purpose in this personalized experiences of their family member having
study. VADER is a rule-based model that summarizes ADRD symptoms, describing in detail specific cases.
20
lexical, grammatical, and syntactical features of text and Top keywords included specific family members, such as
quantifies the tone of sentiment into scores. Compound “mom” and “dad.” As shown in Table 1 and Figure 1, 44.8%
20
VADER scores are normalized from the raw VADER of the posts were negative and 53.2% of posts had a positive
scores and span from −1 to 1, with a negative score sentiment.
representing negative sentiment, and vice versa. We Topic 1 was manually labeled as “appreciation of online
followed the rule of thumb in VADER sentiment analysis community.” This topic included comments in which
and identified those with compound VADER scores caregivers shared gratitude and thanks with other Reddit
<−0.05 as negative comments, −0.05 to 0.05 as neutral, users, showcasing the benefit of these online communities.
and those with compound VADER scores >0.05 to be The top five keywords in Topic 1 were “thank,” “sorry,”
positive comments. “much,” “go,” and “share.” As shown in Table 1 and Figure 1,
Volume 1 Issue 3 (2024) 129 doi: 10.36922/aih.3075

