Computer-Aided Text Analysis

The basic medium of interpersonal and mass communication is text. Analyzing text helps in understanding the meanings of mass media messages and their potential effects, observing strategies and developments of rhetoric, identifying rules and structures of social communication, etc. Thus text analysis, comprising all kinds of qualitative and quantitative techniques of media content, discourse, and linguistic analysis, is an indispensable methodological field for all social sciences. But as communicators’ encoding of meaning and ideas (semantics) in text (syntax) does not always follow a straight logic and is heavily influenced by social and cultural context, the analysis of large text corpuses is a complicated and laborious endeavor. Therefore, from the beginning of computer science, scholars have tried to develop automated tools for understanding, abstracting, and classifying texts.

Compared to the promise of artificial intelligence research in the 1960s and 1970s, today’s facilities for computer-aided text analysis are disappointing. Nonetheless, there is virtually no text analysis without computer support, once one basic premise is met: the text corpus under analysis has to be available in a digital format – as a full text, as a collection of relevant portions of the text (title, author, medium, lead text, date of publication, etc.), or in an indexed fashion (containing additional meta-data like an abstract or keywords, etc.). We can differentiate between computer-based and computer-supported methods of text analysis. While computer-based methods are run completely by computer programs and thus can be automated, computer-supported methods enable or facilitate only selected parts of data collection or analysis and require additional manual operations.

The fields of application of computer-aided text analysis are manifold and cannot be fully outlined in this article. Some examples can give a first impression.

In mass communication studies, content analyses of print media are usually accomplished by human coders. Nonetheless, retrieving relevant articles (i.e., articles dealing with the issue under analysis) from online databases (full text, by keyword or title search, or using other meta-data) drastically reduces the amount of time spent browsing volumes of newspapers or magazines, which in earlier days had to be done manually. There are two drawbacks: first, the retrieval of non-text elements such as photographs, sound files, or movies is limited; second, the identification of relevant articles is restricted to articles containing pre-defined terms, whereas relevant articles using other words (representing the identical meaning) are not found.

A completely computer-based variant of content analysis simply counts word frequencies in mass media content as a macro-indicator of the media agenda, i.e., the increasing or decreasing coverage of specific issues or actors over time (see the agenda-setting study by Funkhouser 1973).

Although providing only limited methodological validity, standard search engines like Google, specialized search services like Google News, or self-programmed web spiders or crawlers can be applied for data-mining on the Internet, i.e., retrieving relevant websites or web pages which can be automatically or manually analyzed afterwards.

Linguistic text-mining tools classify text documents by measuring the vocabulary used (dictionary) or the style of language (e.g., sentence length, the share of nouns and verbs) and conducting cluster analyses with the material. Such tools can be used for discourse and linguistic analyses of email, chat, or discussion forum communication; they can also help to identify unknown authors of historical texts by comparing them with other texts by known authors.

More sophisticated linguistic technologies can identify the semantic meaning of lexical words (disambiguation) by analyzing their context (e.g., the General Inquirer, a classic software package for text analysis developed by Philip Stone in the early sixties at the Massachusetts Institute of Technology; Stone et al. 1966). There are several tools under development that can identify or classify text corpuses at a semantic level and which can summarize or abstract large text corpuses. For simple English texts, the results are quite acceptable; but for more complex texts or texts in other languages, these tools are still unsatisfactory.

Finally, there is a broad variety of computer programs supporting the basic steps of qualitative text analysis (segmenting texts, paraphrasing and classifying propositions, indexing text examples) by elaborate commentary and tagging tools.

Undoubtedly, the Internet is the most important data source for computer-aided text analysis. And it is steadily changing. Some of today’s trends present significant challenges for the analysis of online content.

Many online documents change rapidly, sometimes within seconds (e.g., Internet news sites, discussion forums, emails). This makes online coding almost impossible and requires that data collection is conducted within a restricted period of time. It is almost impossible to identify the content of a particular website without referring to the time of retrieval.

On the web, there is a development from static and text-based documents (HTML) to dynamic media environments (Flash, Java, mp3, mpeg). Recently, most websites provided distinct and constant units of analysis, which could be saved to file for later (re-)examination and scrutinized with established techniques of text analysis (computer-aided and manual). Today, more and more websites offer animated or audiovisual content, with the boundaries of units of analysis becoming more blurred.

The most crucial challenge is interactivity: in interactive environments, there is no fixed text corpus. Instead, the content is automatically arranged and constructed according to user input. This means that the researcher cannot gather and analyze a given and unchangeable portion of text, but he or she has to define an intersubjectively transparent strategy for content retrieval. Methodologically put: the text analysis of interactive material inevitably requires active data production instead of data collection, which challenges the validity and reliability of research.

References:

Batinic, B., Reips, U.-D., & Bosnjak, M. (2002). Online social sciences. Göttingen: Hogrefe.
Funkhouser, G. R. (1973). The issues of the sixties: An exploratory study in the dynamics of public opinion. Public Opinion Quarterly, 37, 62–75.
McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the world wide web. Journalism and Mass Communication Quarterly, 77(1): 80–98.
Popping, R. (2000). Computer-assisted text analysis. London: Sage.
Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The general inquirer. Cambridge, MA: MIT Press.
West, M. D. (ed.) (2001). Theory, method, and practice in computer content analysis. Westport, CT: Ablex.