a resource for multidisciplinary studies on human genetic and linguistic variation
The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic diversity. Each genetic population is associated to the main language spoken by their people. The choice of genetic data corresponds to essential guidelines: maximum compatibility and standardization, modern high quality data, avoidance of ascertainment bias, availability for different regions of the world, and high resolution to capture recent events. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.
The resource is designed to explore connections between our linguistic diversity and the history and diversity of human groups. The use of the scientific information in GeLaTo should be carried with respect of people culture and traditions.
All the genetic populations considered are matched with a unique GlottoCode identifier, which corresponds to the main language spoken by the population. This information is recovered from the original genetic publication, and it is extrapolated either from direct sampling observation, cultural/linguistic self-identification, or geographical characterization, with the assistance of linguists and anthropologists (for a list of people who contributed expertise, see Credits). Languages introduced during colonial ages (widely diffused trans-national languages) are not considered, to exclude the wave of historical language shift documented in the past ~2 centuries.
The GlottoCode link returns the linguistic classification of the genetic population samples.
Geographic location of the populations is based on information on the genetic samples, and not on linguistic information. Migrants are located in their place of origin before the migration, when this information is available: details for migrant populations are indicated in the curation notes.
Multilingualism is a common feature of human populations. In cases of multilingualism, we consider only one langauge as the main ("non-colonial") language present in the population. In some cases, suggestions for alternative language assignation are indicated in the curation notes.
Methods from population genetics are used to calculate values of diversity within and between populations. The genetic summary statistics provided correspond to measures of relatedness and are shaped by the interaction of the ancestors of the individuals who contributed their genetic profile. The summary statistics associated to each population are suitable for population history investigation, and are not intended for any medical or commercial purposes.
Further information on the population genetics methods employed is available in parameters.
If you use this data, please cite
Barbieri et al. 2022. A global analysis of matches and mismatches between human genetic and linguistic histories. PNAS. DOI: 10.1073/pnas.2122084119
as well as the released version of the dataset.
Icons made by Freepik from www.flaticon.com are licensed under CC 3.0 BY