Title:
Social media as a window onto public opinion: Language models are a game changer
Abstract: 
To quickly gain a qualitative sense of the opinions held by a large group of people, social media posts may be a promising resource. But several obstacles must first be overcome. (1) Social media’s massive volume is one of its greatest virtues as a data source but also makes it unreasonable to read each posts in a corpus; (2) The posting format is unstructured, often making it hard for a reader to extract the gist of each post, especially at scale; (3) the characteristics of users are generally unknown, making it hard to speak about who is posting what content.  This talk will present recent work carried out to address these challenges. Specifically, to address the fact that no one can – or has time to - read more than a small number posts, we ask a Large Language Model (LLM) to generate a summary of posts; we evaluate the quality of such summaries using crowd-sourced ratings. We address the challenges of extracting the gist of each post, particularly the key opinion on a topic (such as “favor” or “oppose”), by prompting a LLM to classify the stance of each post; we show that taking stance into account can help reveal alignment between patterns of posting over time (i.e., increases and decreases in volume of posts) and survey responses collected over the same interval. Finally, we describe our (in process) program to address the lack of information about who is posting content by inferring users’ characteristics such as age, education, and partisanship; this will be done by training a model (LLM or a pretrained, transformer model such as RoBERTa) to infer user characteristics. Our overall goal is to help researchers and other information seekers derive accurate, qualitative impressions about public opinion expressed in social media.