前期出版
前期出版
頁數:123﹣180
橋接統計與詮釋:大語言模型輔助內容分析 用於語料庫輔助論述分析的概念驗證
Bridging Statistics and Interpretation: An LLM-assisted Content Analysis Approach to Corpus-assisted Discourse Analysis
研究論文
作者(中)
柯籙晏
作者(英)
Lu-Yen Ko
關鍵詞(中)
人機協作、大語言模型輔助內容分析、混合方法研究、詮釋型資訊工具、概念驗證、語料庫輔助論述分析
關鍵詞(英)
Human-AI collaboration, Large Language Model-assisted content analysis, Mixed methods research, Interpretive information tools, Proof of concept, Corpus-assisted discourse studies
中文摘要
語料庫輔助論述分析研究(Corpus-assisted Discourse Studies, CADS)容易落入「統計編故事」的方法論陷阱,研究者看到統計模式就提出詮釋,卻未檢驗用來支持詮釋的樣本是否真的具有統計代表性。
本研究基於概念驗證(proof of concept),提出以大語言模型輔助的內容分析(Large Language Model-assisted Content Analysis, LACA)作為避開此陷阱的輔助方案,旨在評估LACA是否能透過批次檢查具統計模式樣本的實際語義,更有效整合語料庫分析與論述分析。
本研究建立了一套人機協作的LACA程序,透過實驗檢驗其作為CADS橋接機制的可行性。實驗結果顯示,此程序不僅能達到高編碼信度(Cohen’s Kappa ≥ 0.80),且相較傳統內容分析能夠節省大量時間,使原本因成本限制而「理論上該做但實務上做不到」的檢驗變得可操作。
研究過程中進一步發現,LACA超越批次語義檢驗工具的功能,除確認統計模式的語義基礎外,亦能協助研究者發現未預期的論述模式。本文亦論證LACA作為詮釋型資訊工具,研究者的詮釋邏輯能透過提示詞嵌入LLM的批次處理流程,實現厚描的批次化,橋接統計與詮釋,為混合方法研究的量質整合提供可操作的方法論框架。
本研究基於概念驗證(proof of concept),提出以大語言模型輔助的內容分析(Large Language Model-assisted Content Analysis, LACA)作為避開此陷阱的輔助方案,旨在評估LACA是否能透過批次檢查具統計模式樣本的實際語義,更有效整合語料庫分析與論述分析。
本研究建立了一套人機協作的LACA程序,透過實驗檢驗其作為CADS橋接機制的可行性。實驗結果顯示,此程序不僅能達到高編碼信度(Cohen’s Kappa ≥ 0.80),且相較傳統內容分析能夠節省大量時間,使原本因成本限制而「理論上該做但實務上做不到」的檢驗變得可操作。
研究過程中進一步發現,LACA超越批次語義檢驗工具的功能,除確認統計模式的語義基礎外,亦能協助研究者發現未預期的論述模式。本文亦論證LACA作為詮釋型資訊工具,研究者的詮釋邏輯能透過提示詞嵌入LLM的批次處理流程,實現厚描的批次化,橋接統計與詮釋,為混合方法研究的量質整合提供可操作的方法論框架。
英文摘要
Corpus-assisted Discourse Studies (CADS) face the methodological pitfall of statistical storytelling. Researchers often use keywords in context (KWIC) to purposively select samples matching statistical patterns from corpus analysis and then conduct discourse analysis based on these samples without systematically verifying their statistical representativeness. While content analysis offers a mature solution to this methodological pitfall, its high cost renders it practically infeasible for any CADS study.
This research proposes a Large Language Model-assisted Content Analysis (LACA) validation mechanism to integrate corpus analysis and content analysis in CADS, rendering previously “theoretically necessary but practically infeasible” semantic verification operationally viable, thereby avoiding the pitfall of “statistical storytelling”.
Research Questions
As a proof-of-concept study, this research examines the proposed LACA validation mechanism under a minimum viable configuration, using YouTube comments on the popular song “Fragile” as the corpus and addressing four questions.
1. Can LACA effectively identify semantic relationships between co-occurring words in the corpus?
2. How do different LLM (Large Language Model) models and prompts affect LACA’s coding performance?
3. What is the consistency between LLM coding and researcher standards?
4. Can the proposed LACA mechanism effectively bridge statistical patterns and discourse interpretation in CADS, avoiding the pitfall of “statistical storytelling”?
Research Methods
First, the study obtains statistically representative KWIC samples through systematic sampling, using personal pronouns 我是/I am, 你是/you are, 我們/们/we, and 你們/们/you [plural] as search terms and establishing a reliable foundation for semantic verification.
Second, it systematically develops a Standard Coded Set through human-machine collaborative iterative prompt construction and refinement. This hermeneutic circle of construct → verification (κ) → refinement involves iteratively examining LLM coding results and refining prompts to improve coding standards’ logic and clarity, until consistency stabilizes at Cohen’s κ ≥ 0.8. This ensures coding judgment principles possess clarity and operability.
Third, the study conducts experiments using the established standard coding as an evaluation benchmark, comparing the coding effectiveness and consistency of different LLM configurations (Haiku 3.5 vs. Sonnet 4) and prompt types (simple vs. refined). These experiments verify LACA’s feasibility as a bridging mechanism for CADS.
The coding task distinguishes samples where personal pronouns reference specific identities as A (e.g., “臺灣人, 你們讓人喜歡” / “Taiwanese people, you are likable”) from those that do not. The task’s key challenge lies in distinguishing mere lexical collocation from actual semantic reference. Coding standards must identify not only samples lacking identity word collocation (e.g., “你們讓人喜歡” / “you are likable”) as B, but also false positives where pronouns collocate with identity words without referring to them (e.g., “你們喜歡臺灣人” / “you like Taiwanese people”) as B. Coding reliability validates whether LACA can effectively handle such judgments, ensuring Category A samples’ semantic validity.
Research Findings
The findings reveal that model and prompt configurations significantly impact LACA’s coding performance. When both models use refined prompts to code KWIC samples from four search terms (我是, 你是, 我們/们, 你們/们), Sonnet 4 significantly outperforms Haiku 3.5. Sonnet 4 achieves nearly perfect consistency across all tasks (κ = 0.869-0.979). In contrast, Haiku 3.5’s performance declines with corpus complexity - for the most ambiguous “你們/们” samples, reliability drops to κ = 0.380, or below content analysis standards.
For prompt comparison, when Sonnet 4 codes identical KWIC samples, refined prompts significantly outperform simple prompts. For simpler “我是” samples, both prompts achieve excellent consistency, but refined prompts further improve reliability (from κ = 0.924 to κ = 0.979). Prompt effects are more pronounced on complex corpora: for the most challenging “你們/们” samples, refined prompts elevate consistency from acceptable levels (κ = 0.705) to excellent levels (κ = 0.869). Results demonstrate that selecting appropriate model and prompt configurations is critical to ensuring LACA’s effectiveness.
Beyond its expected function as a batch semantic verification tool, LACA also serves as a systematic filtering tool, assisting researchers in discovering meaningful discourse patterns from semantically-validated and statistically-representative samples. For instance, this research identifies a novel self-identity metaphor from LACA-verified samples: “I am a coconut.”
The research further demonstrates how LACA-verified samples enable identifying discourse patterns in the corpus - specifically, “pervasive irony and distrust toward commenters’ self-declarations” - effectively avoiding the pitfall of “statistical storytelling”.
Discussion
Building on these findings, the study examines LACA’s methodological significance. Results show that its effectiveness depends on two factors: LLM performance thresholds and researchers’ ability to transform domain expertise into executable prompts. However, prompt engineering faces a black-box challenge: specific design principles become obsolete as models evolve, and logically more refined prompts may even reduce coding reliability.
To address this challenge, the study proposes the Clinical-Driven principle of prompt engineering, advocating systematic iterative prompt refinement with empirical effectiveness as the optimization standard. This meta-principle ensures LACA’s continued applicability as LLMs and prompt strategies evolve. Reproducibility depends on transparently documenting decision logic and verification processes and not on replicating specific prompt principles.
LACA embodies the methodological significance of an interpretive information tool. From theory-driven search term selection and methodologically-informed sampling design to clinically-driven prompt engineering, researchers’ theoretical judgments and interpretations are embedded into the CADS process through LACA’s mediation at multiple stages, essentially realizing the batch implementation of thick description.
LACA provides a concrete operational framework for integrating quantitative and qualitative approaches. Across five dimensions, LACA performs strongly. In inference quality, it produces discourse analysis results based on statistically- representative samples. In integration effectiveness, it establishes operational integration procedures, reducing CADS’s frequent failure to integrate quantitative and qualitative results. In expanding understanding, it enables researchers to systematically identify unanticipated discourse patterns in corpora. High human-machine coding reliability (under optimal configuration, all κ values > 0.85) ensures subsequent discourse analysis validity. In feasibility and practical value, compared to traditional content analysis, it achieves approximately significant reductions in both cost and time.
Research Limitations and Future Directions
This research adopts a proof-of-concept minimum viable configuration. Future applications can expand as needed. The single-researcher design can involve multiple researchers. Selecting the straightforward “personal pronoun + identity word” pattern, LACA’s potential for more complex pragmatic phenomena awaits exploration. Future research can explore LACA across different theoretical frameworks and corpus types. Leveraging LLMs’ multimodal capabilities, the Clinical-Driven principle can serve as the meta-guide for LACA’s methodological expansion towards multimodal applications.
This research proposes a Large Language Model-assisted Content Analysis (LACA) validation mechanism to integrate corpus analysis and content analysis in CADS, rendering previously “theoretically necessary but practically infeasible” semantic verification operationally viable, thereby avoiding the pitfall of “statistical storytelling”.
Research Questions
As a proof-of-concept study, this research examines the proposed LACA validation mechanism under a minimum viable configuration, using YouTube comments on the popular song “Fragile” as the corpus and addressing four questions.
1. Can LACA effectively identify semantic relationships between co-occurring words in the corpus?
2. How do different LLM (Large Language Model) models and prompts affect LACA’s coding performance?
3. What is the consistency between LLM coding and researcher standards?
4. Can the proposed LACA mechanism effectively bridge statistical patterns and discourse interpretation in CADS, avoiding the pitfall of “statistical storytelling”?
Research Methods
First, the study obtains statistically representative KWIC samples through systematic sampling, using personal pronouns 我是/I am, 你是/you are, 我們/们/we, and 你們/们/you [plural] as search terms and establishing a reliable foundation for semantic verification.
Second, it systematically develops a Standard Coded Set through human-machine collaborative iterative prompt construction and refinement. This hermeneutic circle of construct → verification (κ) → refinement involves iteratively examining LLM coding results and refining prompts to improve coding standards’ logic and clarity, until consistency stabilizes at Cohen’s κ ≥ 0.8. This ensures coding judgment principles possess clarity and operability.
Third, the study conducts experiments using the established standard coding as an evaluation benchmark, comparing the coding effectiveness and consistency of different LLM configurations (Haiku 3.5 vs. Sonnet 4) and prompt types (simple vs. refined). These experiments verify LACA’s feasibility as a bridging mechanism for CADS.
The coding task distinguishes samples where personal pronouns reference specific identities as A (e.g., “臺灣人, 你們讓人喜歡” / “Taiwanese people, you are likable”) from those that do not. The task’s key challenge lies in distinguishing mere lexical collocation from actual semantic reference. Coding standards must identify not only samples lacking identity word collocation (e.g., “你們讓人喜歡” / “you are likable”) as B, but also false positives where pronouns collocate with identity words without referring to them (e.g., “你們喜歡臺灣人” / “you like Taiwanese people”) as B. Coding reliability validates whether LACA can effectively handle such judgments, ensuring Category A samples’ semantic validity.
Research Findings
The findings reveal that model and prompt configurations significantly impact LACA’s coding performance. When both models use refined prompts to code KWIC samples from four search terms (我是, 你是, 我們/们, 你們/们), Sonnet 4 significantly outperforms Haiku 3.5. Sonnet 4 achieves nearly perfect consistency across all tasks (κ = 0.869-0.979). In contrast, Haiku 3.5’s performance declines with corpus complexity - for the most ambiguous “你們/们” samples, reliability drops to κ = 0.380, or below content analysis standards.
For prompt comparison, when Sonnet 4 codes identical KWIC samples, refined prompts significantly outperform simple prompts. For simpler “我是” samples, both prompts achieve excellent consistency, but refined prompts further improve reliability (from κ = 0.924 to κ = 0.979). Prompt effects are more pronounced on complex corpora: for the most challenging “你們/们” samples, refined prompts elevate consistency from acceptable levels (κ = 0.705) to excellent levels (κ = 0.869). Results demonstrate that selecting appropriate model and prompt configurations is critical to ensuring LACA’s effectiveness.
Beyond its expected function as a batch semantic verification tool, LACA also serves as a systematic filtering tool, assisting researchers in discovering meaningful discourse patterns from semantically-validated and statistically-representative samples. For instance, this research identifies a novel self-identity metaphor from LACA-verified samples: “I am a coconut.”
The research further demonstrates how LACA-verified samples enable identifying discourse patterns in the corpus - specifically, “pervasive irony and distrust toward commenters’ self-declarations” - effectively avoiding the pitfall of “statistical storytelling”.
Discussion
Building on these findings, the study examines LACA’s methodological significance. Results show that its effectiveness depends on two factors: LLM performance thresholds and researchers’ ability to transform domain expertise into executable prompts. However, prompt engineering faces a black-box challenge: specific design principles become obsolete as models evolve, and logically more refined prompts may even reduce coding reliability.
To address this challenge, the study proposes the Clinical-Driven principle of prompt engineering, advocating systematic iterative prompt refinement with empirical effectiveness as the optimization standard. This meta-principle ensures LACA’s continued applicability as LLMs and prompt strategies evolve. Reproducibility depends on transparently documenting decision logic and verification processes and not on replicating specific prompt principles.
LACA embodies the methodological significance of an interpretive information tool. From theory-driven search term selection and methodologically-informed sampling design to clinically-driven prompt engineering, researchers’ theoretical judgments and interpretations are embedded into the CADS process through LACA’s mediation at multiple stages, essentially realizing the batch implementation of thick description.
LACA provides a concrete operational framework for integrating quantitative and qualitative approaches. Across five dimensions, LACA performs strongly. In inference quality, it produces discourse analysis results based on statistically- representative samples. In integration effectiveness, it establishes operational integration procedures, reducing CADS’s frequent failure to integrate quantitative and qualitative results. In expanding understanding, it enables researchers to systematically identify unanticipated discourse patterns in corpora. High human-machine coding reliability (under optimal configuration, all κ values > 0.85) ensures subsequent discourse analysis validity. In feasibility and practical value, compared to traditional content analysis, it achieves approximately significant reductions in both cost and time.
Research Limitations and Future Directions
This research adopts a proof-of-concept minimum viable configuration. Future applications can expand as needed. The single-researcher design can involve multiple researchers. Selecting the straightforward “personal pronoun + identity word” pattern, LACA’s potential for more complex pragmatic phenomena awaits exploration. Future research can explore LACA across different theoretical frameworks and corpus types. Leveraging LLMs’ multimodal capabilities, the Clinical-Driven principle can serve as the meta-guide for LACA’s methodological expansion towards multimodal applications.
37次下載