I’ve been thinking about the changing influence of large scale datasets on how social scientists understand difference. For the most part, it is pretty easy to analyse survey data in terms of gender; and while class and status are more complex, there several well-understood (if not always agreed) approaches to categorising people by occupation or income. But when it comes to ethnicity, we’re challenged.
Part of the problem arises because people have strong feelings about ethnicity. A storm of protest met early attempts to collect information about ethnicity in the census. Proposals to include ethnicity in the 1981 census disappeared, and although it has featured since then, there has been repeated controversy over which categories to use.
Data on ethnicity are also collected in the main longitudinal surveys that provide such rich source material for social scientists in Britain. The cohort surveys and panel surveys have informed major studies of social mobility, as well as providing the raw material for recent research into the benefits of adult learning. However, it has so far been very difficult to analyse these surveys in terms of ethnicity.
The researcher faces a dilemma. Either you aggregate the responses of people from different ethnic groups, using an umbrella category such as ‘South Asian’, in which case you will miss very important variations between them. Or you present your findings for each separate group, while making it clear that they are based on such a small number of respondent that the results are statistically insignificant.
This is likely to change in the near future. First, some of the major surveys now involve boosted samples of minority ethnic respondents. The Millenium Cohort Survey, for instance, was structured by neighbourhood, allowing for areas with high proportions of ethnic minorities to be deliberately over-represented (researchers will, of course, allow for this when analysing responses).
Second, researchers increasingly have access to large bodies of administrative data, suitably anonymised. They can then use linkage techniques to analyse information on individuals that was originally collected by the NHS, education authorities and other public bodies. This approach is being pioneered in Scotland, and offers considerable potential for detailed and robust statistical studies of small groups.
And thirdly, information processing methods allow researchers to ask extremely complex questions of large datasets. I remember carrying copies of completed questionnaires over to something called an electronic data processing centre at Warwick, which then seemed very zippy to me. It took a couple of weeks before I had the results, and longer still if anything needed running again. Now, advanced statistical processes take a laptop an afternoon.
In other words, it is going to be much easier to use large datasets to study ethnicity. We will not only be able to distinguish between smaller categories of ethnicity for minority groups, but also among those of white European origin. And we’ll be able to ask new questions and draw on new types of data – indeed, in principle, we could even link survey data with individual genetic information.
I’m not convinced that giving social researchers access to people’s genetic codes will happen any time soon. It might, as it is only a small step from exploring how people’s genetic background affects health to considering how it might affect other life chances. My point at this stage is that that our capacity for studying ethnicity has expanded dramatically, and is growing. This should be a force for enriching social science, and improving its public impact, but it won’t be an easy process.