Even if questionnaire designers introduce open-ended questions in their surveys, answers to open-ended questions remain too often underexploited. Indeed, statistical analysis require substantial efforts to code textual data using a limited number of pre-defined categories. The traditional manual technique, which consists in identifying the main themes in the textual corpus, appears often too arbitrary and subjective to quantitative researchers. At the same time, the shorter and necessarily more stereotyped responses collected from open-ended questions rather than from in-depth interviews cannot obviously reach the same degree of complexity. These reasons lead to neglecting open-ended questions. Yet combining qualitative and quantitative approaches could take advantage of the detailed responses and of the summary produced by statistical methods.
From the example of lay perceptions of health, this paper sets out to present how multivariate exploratory methods, namely correspondence analysis and cluster analysis, can successfully be applied to textual data to produce a standardized classification of the individual responses. The data used here come from the “enquête permanente sur les conditions de vie des ménages” carried out by the French National Institute for Statistics and Economic studies (INSEE). In 2001, the rotating module focused on health and included an open-ended question: “What does being healthy mean to you?” (“Etre en bonne santé, qu’est-ce que cela signifie pour vous?”). Because of its size of 5000 individuals and its representative sample, this survey offers great possibilities to know the various perceptions of health in the whole population, to quantify their respective importance and to describe them by individual social and medical characteristics.
The aim of this paper is to discuss the advantages and the limits of the transition from texts to codes based on statistical methods. Turning the individual responses into a typology of health perceptions is not straightforward and fully carried out by the SPAD statistical package. At each stage, from grouping words together to producing the final typology by cluster analysis, the researcher is faced with subjective choices, which could be decisive in terms of conclusions but are also essential for making sense of textual data.
The statistical methods used (correspondence analysis then cluster analysis) aim at identifying several sub-groups of individuals formed only on the basis of the words they used to define good health. From the 1409 distinct words grouped together under 120 lemmas, the cluster analysis reveals finally ten categories: “healthy lifestyle”, “energy”, “absence of serious illness”, “well-being”, “never go to the doctor’s”, “absence of medicine”, “absence of pain”, “being happy and leading a normal life”, “capacity”, “dependency” which seem sharply affected by biological and social age, i.e. the deterioration of health with age and the life cycle. In conclusion, these findings support the application of multivariate exploratory methods to textual data as a computer-aided coding which produces less subjective categorizations and reduces treatment costs of open-ended questions.