So, do the different linguistic family trees of Indo-European tell you different things? I think that grammar-based trees are better at telling you about the original splits between language sub-groups, whereas word-based trees can tell you about how isolated those sub-groups became.

The family trees of Indo-European as envisaged by Don Ringe and colleagues on the one hand, and Russell Gray & Quentin Atkinson on the other.

Why are these two family trees for Indo-European so different in shape?

At the beginning of the millennium two papers were produced which both purported to show, using statistical ‘cladistic’ analysis, the structure of the Indo-European language family. The resulting family trees of the two studies were completely different.

Gray & Atkinson’s 2003 study was based largely on modern IE languages, and used the existence of related words in the lexicon of each of the languages to construct a family tree. Ringe, Warnow and Taylor’s 2002 study instead used the earliest languages that they could and, as well as the lexicon, also used grammatical (morphology) and sound change (phonology) features to help classify the results.

And the tree shapes would have, frankly, looked much the same if the linguist of the latter group, Don Ringe, hadn’t known some of the answers that he wanted beforehand. His frustration was that there were loads of shared words between languages but not that many shared morphological or phonological features. Yet his hunch was that the latter features were more important than the words.

So what Ringe’s group did was to weight morphology and phonology hugely compared to the words. This gave them answers that made sense to Don Ringe but left the group subject to some criticism for effectively biasing the data. What he was saying was, in effect, that the words were of lesser importance than the grammatical rules.

Now, what I find frustrating is not that Ringe’s group weighted the data but that they didn’t just go the whole hog and dump the lexicon altogether, just concentrating on morphology and phonology. Maybe they didn’t think that there was enough data. Either way it would have been nice to see the results of such a study.

Building trees from words, sounds and grammar

Don Ringe’s bias was to say that words are important, but not that important in constructing the history of IE languages. He attributes this in part to word borrowing between languages. Whilst Ringe’s group did their best to spot word borrowing and filter out the borrowed words, they admit that they may have failed to detect borrowing between languages.

In this they appear to disagree with Andrew Garrett, who thinks exactly the opposite. Andrew Garrett’s position is that the current phonology and grammatical rules of IE language sub-groups are late features, formed as a result of geographically adjacent IE dialects converging in their grammars and sounds. This, he argues, is the result of crises. Therefore Garrett prefers to see the words as being key to older connections, as these are less likely to be changed in his opinion.

The aim of this post (which has, once again, turned into an essay) is to try to see if it’s possible to look at family trees for either grammar, sounds or words and see how they differ and how they might be affected either through original shared connections or through geographical convergence.

Let’s study Indo-European morphology!

Now personally, I can’t do cladistic analysis, but what I can do is play with spreadsheets.  What I’ve done here is to take phonological and morphological differences recognised between IE language sub-groups and analyse them in order to find if they reveal anything on their own.

To increase the size of the list I’ve added the morphology list of Gamkrelizde & Ivanov (1995), page 345 to the phonology and morphology lists of Ringe et al. (2002) (as supplemented by the updated list of Nakhleh 2007). There may be other good, equivalent lists out there, but I don’t know about them. It should be pointed out that Ringe and G&I’s lists overlap in terms of their morphological criteria, and there are a couple of cases where a morphological criterion from one list can be mapped exactly onto that from the other.

However, the morphological interpretations of Ringe and G&I are clearly not the same. Surprisingly, in many cases where the morphological criteria appear to be describing the same grammatical feature, the list of which language sub-groups share that criterion is often not even the same. In such cases I’ve allowed each criterion to be left separate as this should prevent any bias whilst also preventing any unnecessary loss of data. However, where their morphological criteria are identical I have merged them.

Furthermore, to save complications I have only used criteria which allow the comparison of whole, well established language sub-groups. I have not used criteria which only occur in one language sub-group, as these are not informative when trying to compare sub-groups for similarity. Additionally, all criteria indicate the presence, rather than the absence of a morphological or phonological feature.

This leaves me with 40 morphological and 4 phonological criteria, a total of 44 criteria. As the 4 phonological criteria are not sufficient to be analysed on their own I’ve incorporated them into the morphological analysis.

A list of Indo-European grammatical features and which languages they occur in.

The grammatical and phonological criteria used in this analysis and which Indo-European languages they occur in (Ce = Celtic, It = Italic, To = Tocharian, An = Anatolian, Ar = Armenian, Gr = Greek, II = Indo-Iranian, Sl = Slavic, Ba = Baltic, Ge = Germanic). The fact that this looks like a question mark is pure chance.

The list to the right shows the criteria discussed. Those marked M and P are morphological and phonological features taken from Ringe et al. (2002) p113-120 and Nakhleh 2007. Those marked G* are morphological features taken from G&I p345.

Where criteria in Ringe and G&I are similar but not identically allocated to language sub-groups, the related criteria are shown in the next column. If a criterion is exclusive (i.e. it clashes with another), then the criteria which cannot occur simultaneously are listed in the ‘opposes’ column. All criteria marked by an asterisk (*) are considered polymorphic (i.e. could develop more than once) by Ringe’s team and I’ve extended these to G&I. Criterion M11(2), although omitted by Ringe’s team from their later analysis, has been left in here, but with additions to show that it may not be unique to Italic and Celtic.

The presence of a particular feature in an IE language sub-group is indicated by a 1. Where 0.5 is given this is where G&I have indicated a potential occurrence in a language sub-group by brackets or, as in the case of M11(2), Ringe’s team suspect that related features may occur in this sub-group. These 0.5s have been ignored in the analysis, although they actually make little to no difference to the analysis.

You’ll notice that the language sub-groups and criteria have not been put in alphabetical order. They are, in fact, in the order that appears to give the highest correlations between each successive family and criterion. While it looks clearer, it also gives a hint as to how the morphology and phonology of the language sub-groups might be connected.

Please note that I have not included Albanian, Phrygian, Messapic and Venetic, which are listed either by Ringe’s team or by G&I. This is because they appear to share very few features with the other sub-groups, presumably due to either heavy loss of features (in the case of Albanian) or due to lack of evidence.

The results

The first table indicates simply the number of criterion correspondences between different language sub-groups. It’s difficult to interpret as different language sub-groups have different numbers of potential corresponding features. The number of potential correspondences in a sub-group is indicated where both column and row have the same label: e.g. row Gr, column Gr shows the value 19, so 19 is the maximum number of correspondences that the Greek language sub-group could have with other sub-groups.

The second table shows the percentage of criterion correspondence between sub-groups. The diagonal line of 100s is simply the result of comparing a language family with itself (which gives 100%). To compare how many features Tocharian shares, say, with Italic, find the column To and compare with row It.

In this case it shows that Tocharian shares 89% of its features with Italic, but note that Italic shares only 47% of its features with Tocharian. This is firstly because Tocharian has only 9 potentially sharable grammatical features, whereas Italic has 17, meaning that Italic could never share more than 100*9/17=53% of its features with Tocharian.

Significantly, the phonological features used have little overall effect on the pattern seen and could have been removed, giving much the same result.

What comes out clearly from the data is the strong morphological relationship between Italic and Celtic as well as, to a lesser extent, of Tocharian with Italic and Celtic.

The morphological relationship of Anatolian with all other sub-groups of IE is fairly weak across the board, perhaps as a result of changes occurring either in Anatolian or in the rest of the IE sub-groups when they were still clustered as ‘late PIE’. However, although distant, Anatolian’s best apparent correlation is with Tocharian. Note however that Tocharian shares just 33% of its features with Anatolian. These are not close cousins.

Amongst the remainder, Baltic shares many features with Slavic (as is well known). Greek, Armenian and Indo-Iranian also appear to be related in some way, Greek and Armenian perhaps being more closely connected out of the three. Interestingly, Germanic shares many features with both Indo-Iranian and Baltic.

All of this evidence tends to suggest a grammatical and phonological divide between Tocharian and ‘Italo-Celtic’ on the one hand; and Armenian, Greek, Indo-Iranian, ‘Balto-Slavic’ and Germanic on the other.

However, for the remaining languages: Greek, Armenian, Indo-Iranian, Germanic, Slavic and Baltic, there appear to be chains of connection between the different sub-groups which don’t lend themselves obviously to making divisions between groups of languages.

Exclusive States

It could be argued that what we’ve got here could be purely based upon chance. Theoretically, all of these morphological and phonological features could have existed in the earliest forms of PIE and been lost randomly in individual sub-groups.

However, this is not possible in a number of cases. Some criteria, such as G13/M5, form pairs; in this case G13b/M5(1), and G13a & M5(2), indicate the type of endings found in the middle verbs of IE languages. These are exclusive states. It’s not possible to have both.

Using these criteria, it’s theoretically possible to cluster the sub-groups into groups:

Group A – Italo-Celtic, with Tocharian and Anatolian – this is based upon the presence of criteria G10a and G13b/M5(1).

Group B  – Those sub-groups not in group A – this is based upon the presence of criteria G10b and related criteria M5(2) and G13a. This group (called ‘Rump IE’ by Jay Jasanoff) can perhaps be further subdivided into:

Group C  – Greek, Armenian and Indo-Iranian (called the ‘Southern Dialect Group’ by Jasanoff) – this is based upon the presence of potentially polymorphic criteria G2 and G4a/M10(4). As Ringe’s team have pointed out, the possibility that these features arose more than once in the development of IE sub-groups makes this grouping weaker than it should be, but it’s the best I’ve got.

Group D – Balto-Slavic and Germanic (called the ‘Northern Dialect Group’ by Jasanoff) – this is based upon the presence of the polymorphic criteria G2c and the interrelated G4/M10a(10), M10b(13) and M9b(10), again with the same potential weaknesses as group C.

The major problem with Groups C and D is the presence in both groups of the ‘satem’ language sub-groups Baltic, Slavic and Indo-Iranian, defined by phonological criteria P2(2) and P3(2). ‘Satemization’ is not a reversible process, so if the division is correct then P2(2) and P3(2) would need to be polymorphic. One way or the other, polymorphism is a problem for subdividing group B.

Comparison with Ringe’s Analysis

The Ringe group’s best fit, family tree of Indo-European sub-groups, based on all criteria but weighted for morphology and phonology, showing the latest positions that individual grammatical or phonological criteria could have arisen.

My preferred tree for Indo-European branches based purely on grammar and phonology. Note that I have chosen to see phonology as potentially able to be passed between branches.

My preferred tree for Indo-European sub-groups based purely on grammar and phonology. Note that I have chosen to see phonology as potentially able to be passed between sub-groups.

What’s shown here is the morphological tree that appears to fit the data best to me, compared to the best tree of Ringe’s team (from Nakhleh et al. 2005), produced using words as well as morphology and phonology, but heavily weighted toward morphology and phonology.

I have added morphological and phonological criteria to both trees in the position of their latest possible appearance on each tree. Criteria in black are exclusive, and are probably the most important. However, those starred are potentially polymorphic, making them less useful. Crossed out, grey criteria indicate the last possible occurrence of these criteria. Phonological criteria are red.

As Ringe’s team have pointed out, the aim of any tree is to have a continuous part of the tree occupied by one feature, except in the case of polymorphism. However, I would like to go further, and add that the aim of any tree like this is to get the latest possible appearance of the criteria as far to the right of the tree as you can.

I have chosen to downgrade the significance of phonology, accepting the argument of some linguists that satemization, as represented by P2(2) and P3(2), was a polymorphic development, happening in more than one sub-group, perhaps by convergence (e.g. the discussion over Nuristani in the Indo-Iranian family – see Hegedűs (2012)). In fact, even Ringe’s team have said ‘Because P2 and P3 are less secure than the other phonological characters and than most morphological characters, one cannot easily judge the performance of any given method by how it treats these two characters’.

Either way, this is a matter of choice. If I had taken P2(2) and P3(2) as being important I would have clustered Balto-Slavic with Indo-Iranian and made Germanic the next most closely related language, with Greek and Armenian separate.

My reason for not doing so is that it appears to push too many mutually exclusive criteria to the left of the diagram, even meaning that criteria such as M10(4) and M10(10) are required to be common at the same level, something which again only makes sense if polymorphism is used to justify their emergence repeatedly, as Ringe’s group do.

Apart from this point, the other important difference is in my grouping is of Italo-Celtic with Tocharian, which to me fits the morphology data better.

Incompatibility with words

In terms of incompatibilities (i.e. criteria or words which appear not to fit the trees), both show some incompatibilities with the filtered word list of Don Ringe’s team. For my tree, 15 words are incompatible, one twice (allx2 beard break1 breast1 drink  fish1 fog2 free give1 one pour long1 straight tearx2 young2) not forgetting the additional 2 phonological incompatibilities. In the case of Ringe’s tree, 14 words are incompatible (all1 arm beard break1 breast1 float2 free head one pour straight tear thousand1 young2). Ringe clearly wins.

Words connecting incompatible sub-groups are, for both trees, largely between Germanic and either Italic or Celtic. In Ringe’s tree this is 7 out 14, whereas for mine it is 11 out of 16. As Ringe pointed out, this suggests some peculiarity of Germanic. However, the positioning of Germanic with Balto-Slavic on my tree eliminates 3 of Ringe’s incompatibles (arm float2 thousand1).

The major problem with my tree however, which lends weight to Ringe’s analysis of the position of Tocharian, is that two of my incompatibles (drink give1) are between Tocharian and Anatolian, and these could be eliminated by pairing Tocharian with Anatolian against the rest. Reordering this brings my incompatibles down to 14, of which 11 are still for Germanic with Italic or Celtic, and 2 of the remaining 3 are for 1 word.

Comparison of word and rule based trees

My modified tree, based on the word evidence of Tocharian's early separation before Italo-Celtic.

My modified Indo-European tree, based on the word evidence of Tocharian’s early separation before Italo-Celtic.

A Ringe/Gray&Atkinson composite Indo-European tree, based purely on the evidence of words.

A Ringe/Gray&Atkinson composite Indo-European tree, based purely on the evidence of words.

Here are the diagrams of my modified morphology-based tree and Ringe’s purely word-based tree from Nakhleh et al. (which is much the same as Gray & Atkinson’s tree). In terms of word problems, this tree has just 9 words which don’t fit, compared to the 13 from my tree. However, as shown by the exclamation marks, at least three morphological criteria don’t fit, as well as three phonological criteria.

Apart from Anatolian separating first, the trees shown here are essentially different. Groups A, B, C and D do not occur in the lexical tree. Three morphologically exclusive sets, M5/G13, G10 and M8, are not obeyed by the lexical tree.

If you take the view that morphology trumps words then the tree on the left or my previous tree is more likely to be correct. If you agree with Garrett that words trump rules then (if you had to draw a tree) it would be more like that on the right. If you take the view of Ringe et al. (2002) then some kind of compromise tree is best.

So which, if any, is right?


A cladistic tree for Indo-European allowing late interconnection of dialects after their split.

Is this the best tree for Indo-European languages? In making some branches interact while still mutually intelligible, this forces Anatolian, Tocharian, Armenian and Greek to the edge, whilst allowing the remaining groups to interconnect.

I think that it’s essential to bring in geography at this point and to consider which language sub-groups are likely to have been closest to which others during history. In this case, the extremes of geographical location are of Tocharian in the East (within the Tarim Basin of modern Western China) and of Celtic in the West.

Notably, these two language sub-groups appear to be quite close morphologically. This means that morphological convergence is immensely unlikely for these two language sub-groups. Conversely, Germanic, Celtic and Italic were adjacent to each other at the beginning of the Iron Age and yet Germanic is about as far, morphologically, from Celtic and Italic as it’s possible to get.

Therefore it seems to me that Andrew Garrett must be wrong, at least in his conclusions about large-scale convergent morphology. On the other hand, Garrett arguably has a better case with phonology, as it’s difficult to reconcile the morphological evidence with the phonological evidence. Furthermore, Indo-Iranian was possibly in geographical proximity to Slavic and/or Baltic during the late Bronze or Iron Age, so satemisation may have occurred during contact.

The branching that appears to be happening on the word-based tree is, on this reading, a result of long-term geographical separation. Therefore the apparent grouping of Baltic, Slavic, Germanic, Italic and Celtic is a result of these language sub-groups coming into geographical proximity within Europe whilst still mutually intelligible but after dialect variations had occurred.

On the other hand, Anatolian and Tocharian and, to a lesser extent, Greek or Armenian, appear to have become separated from the rest of IE for long enough to make them unintelligible when IE languages came into contact with them again later in history.

Was this exercise useful? Probably not. Do I feel like I’m just re-inventing Ringe’s tree? A bit. However, I do think that it’s useful to see both the Ringe group tree and the Gray & Atkinson-type tree as both potentially useful in giving clues as to the prehistoric development of IE languages.


(Neditors note – You know I now realise that so much of the argument below depends on my belief that elite dominance is not a sufficient mechanism for language change. I prefer to see major intrusion of a new language carrying people as necessary (i.e. that 30% or more of the population will now be the new language carriers, probably in a dominant position). This is why I have a problem with the ‘full steppe’ model. However, if you don’t have this prejudice then the full steppe model is quite reasonable).

Does the new ancient genetic data put the homeland of Proto-Indo-European languages in the Black and Caspian Sea steppe or doesn’t it? Mostly, although the ancient Armenians are not entirely playing ball, and may mean that the steppe still isn’t the homeland.

Map of western Eurasia, showing the general view of the spread of Indo-European language familes from a homeland in the steppe north of the Caucasus and Black Sea.

The steppe homeland for Indo-European languages, as argued by David Anthony for example. Is this model now proved right?

So… time to throw away my copy of ‘Archaeology & Language’ by Colin Renfrew and, probably, anything written about Indo-European by Bouckaert, Atkinson or Gray (I think most people did this long ago).

What has the European ancient autosomal DNA data coming out over the last three years shown me? It’s that I was spectacularly wrong about Proto-Indo-European (PIE). This is probably good for me. The linguists are certainly happy, having been proven basically right, and many archaeologists, notably David Anthony, are pretty happy too, although some are just deeply confused.

It makes you realise that although you think you’re thinking for yourself, actually what you’re doing is fitting into a mainstream view of archaeology which has prevailed since the 1970s – that people don’t move much and, at best, that they ‘culturally interact’.

So what we’re left with is one theory really, Marija Gimbutas’ Kurgan Hypothesis, which is that early IE languages were spoken in the southern steppes north of the Caucasus and Caspian Sea (aka the ‘Pontic-Caspian Steppe’), and that huge migrations of actual people smeared them across much of Eurasia between 3000 and 1000 BC. So far so good.

But there are still some things which are not resolved about the origins of Indo-European languages. The most obvious one is the result of this same genetic analysis and shows that IE migrants from the steppe were the descendants of two sets of people, hunter gatherers of the Russian steppe and farmers from somewhere around the Caucasus. Which of these supplied the language base for PIE is unknown. It may seem a technical point, seeing as the rest appears to be sewn up. However, it may mean that some modern IE languages didn’t originate in the steppe at all.

First, though, we should probably do a quick rerun of the main events now showing up from the genetic data.

The genetic prehistory of Europe in maps

Map of Europe/western Eurasia for 7000 BC, showing four main genetic groupings, WHG in west, EHG in northern Europe and the steppe, Anatolian and Levant Neolithic in the Middle East, and CHG/Iran Neolithic between the Black and Caspian Seas

Map of currently identified genetic groupings in western Eurasia/Europe around 7000 BC, before the commencement of farming across Europe. based on data listed in the references.

The relevant genetic populations in western Eurasia at the beginning of this story (around 7000 BC) are five. We’ll start with the first two:

  • Western Hunter-Gatherers or WHG*: this genetic population cluster occupied much of southern and western Europe at this time. In the north and east they abutted…
  • Eastern Hunter-Gatherers or EHG*: this genetic population are a kind of hybrid between WHG and a population from the steppe known as Ancestral North Eurasians (ANE, currently represented by one much older ancient DNA sample known as Ma’lta from the Lake Baikal Area).

*These labels are those used by researchers in ancient genetics for the genetic clusters which they’ve identified.

The boundary between WHG and EHG passed west to east through the Baltic region, dividing the Baltic states in the east before taking a southward turn to join the Black Sea at its western end. Populations either side of the boundary appear to be hybrids between WHG and EHG (e.g. Scandinavian Hunter-Gatherers or SHG and Ukraine_Mesolithic), although there may be other minor components in the Balkans.

In the south-east, there are three other populations:

  • Anatolian_Neolithic: this population, located in Anatolia, is important in spreading farming to all of the Balkans, western Europe and parts of the Ukraine between 7000 and 4000 BC. Unfortunately, the predecessors of AN in Anatolia have not yet been reported on. It is possible that it’s made up of a mix of earlier populations from the Balkans, Levant and Iran (due to their genetic similarity, on the diagram I’ve just lumped Anatolia and the Levant together in blue).
  • Iran_Neolithic: this population, found in NW Iran, shows some possible connection with modern south Asian populations.
  • Caucasus Hunter Gatherers or CHG: this population, found in the Caucasus of course, could be a mixture of Iran_Neolithic and mixed EHG/WHG populations, perhaps from the steppe, but also needs to include another, unknown population (NB I’ve lumped these last two related populations together as yellow as it was just becoming too messy with them separate).

These three populations appear to have interacted and mixed to some extent in the middle east in the period between 7000 BC and 4000 BC.

 Western Eurasia/Europe around 4000 BC, with Early European farmers (largely descended from Anatolian Neolithic peoples) dominating much of Europe and mixing of EHG, CHG and Near Eastern neolithic populations between the Black Sea and Caspian Sea.

Western Eurasian/European genetics around 4000 BC, the probable time that PIE was spoken, showing the changed populations in Europe after the introduction of farming, as well as the mixing of populations near the Black and Caspian seas.

By 4000 BC, the Chalcolithic period and the beginning of the PIE window, the following changes seem to have happened to Western Eurasia:

  1.  As a result of the spread of farming and people from Anatolia between 7000 and 4000 BC, most of Europe’s population has become a new, but relatively homogeneous group, known as EEF (Early European Farmers), which shows descent largely from  AN, but with a considerable WHG component, perhaps varying between 10 and 30% (greater the further west). This includes Western Ukraine, where populations contain only perhaps 20% of the previous WHG/EHG mixed genetics.
  2. The eastern Baltic states show an increase in EHG ancestry at the expense of WHG, perhaps resulting from hunter gather population replacement or movement from the northeast.
  3. The area to the north of the Caspian Sea, including Russia and the Eastern Ukraine, shows major genetic influx from the south, as Iran Chalcolithic type genes now dilute the previous EHG genes by between 40 to 50%, forming a new population which I’ve (now) labelled Iran/EHG hybrid, but is called Samara_Eneolithic by geneticists. This hybridisation appears to have started around the middle of the 5th millennium BC and is possibly represented by David Anthony’s ‘late Khvalynsk’ culture.
  4. NE Anatolia/Southern Caucasus and NW Iran appear to have experienced a considerable influx of genes (perhaps 30%) from Anatolian (or even EEF) and Levantine populations coming from the south and east. This is offset slightly by a minor influx of Iranian and Anatolian genes to the Levant, and suggests continued mixing throughout the Near East.
Europe/Western Eurasia 3000 BC. This map is similar to that for 4000 BC, except that CHG/EHG hybrids (here called Yamnaya) have now expanded westward to reach the eastern Balkans and the edge of the Baltic.

Europe/Western Eurasia around 3000 BC, showing the expansion of Iran/EHG hybrid populations (now given the name ‘Yamnaya’) westward, and the expansion of CHG or Iranian populations further into Anatolia.

A thousand years later (3000 BC), now within the PIE (‘wheels and wool’) period, the following changes have happened:

  1. Influx of East Anatolian/Caucasian populations into the rest of Anatolia and to Greece (more than 10%).
  2. Influx (maybe 20%) of Pontic-Caspian steppe populations into the northern Balkans. This can be equated with Kurgans appearing in the Balkans.
  3. Massive influx or even replacement of Ukrainian and northern Baltic populations by the hybrid Iran/EHG population (now called Yamnaya by geneticists) from north of the Caspian Sea. This can be equated with the Yamna (aka Pit-Grave) archaeological horizon. The archaeology suggests that this expansion was quite rapid, sometime around 3300 BC, and many cultural changes occur across the Ukrainian steppe at this time.
Europe/Western Eurasia 2000 BC as before but Yamnaya populations have spread across northern Europe, with some admixture of EEF farmers, and further spread of CHG type populations into Anatolia.

Western Eurasia/Europe at about 2000 BC, showing the expansion (and slight dilution) of Yamnaya-type populations into northern Europe. Note the subtle, greeny-orange of apparent ‘backflow’ of Corded Ware people from Europe to the Urals and beyond.

During the next a thousand years, up to 2000 BC, the following movements are seen:

  1. Around 2800 BC a massive influx (70% or more) of Yamnaya genes into the North European plain to produce Corded Ware populations (pots, not clothing) in East and Central Europe. This can be equated, funnily enough, with the ‘Corded Ware Horizon’. Further west, further expansions result in other population replacements by hybrid Central European populations (associated, to some degree, with the ‘Bell Beaker phenomenon’). This process is associated with the introduction of R1a and R1b1a1a Y-haplogroups to western Europe. Notably, Yamnaya-type genetics are also found in the Afanasievo population, far to the East in the Altai mountains, around 2700 BC, and this seems to be part of the same expansion.
  2. Bizarrely, at the end of the millennium there is a possible ‘back migration’ of the new, ‘Corded Ware’ type (hybrid Yamnaya and EEF) populations into the southern Urals in the Sintashta population. In the next couple of hundred year this population-type also spread further east, with the Andronovo population of 2nd millennium BC Altai mountains showing the same genetics and possibly representing a swamping or replacement of previous Afanasievo populations here.
  3. Continued influx of NE Anatolian/Caucasian populations into Anatolia and Greece (finally between perhaps 20 and 50% depending on whether it originates in the Caucasus or NE Anatolia – this is probably associated with the introduction of J2a1 Y-haplogroups to the Aegean).

After this restless period, the genetic data for the next thousand years is more limited and I haven’t drawn further maps. However, the following things are noticeable:

In Europe, north European populations are relatively genetically stable, but showing interbreeding, convergence and a slight increase in EHG/WHG type ancestry, suggesting either evolutionary advantage of these genes or, more likely, hidden populations at the margins of society which then intermix.

In the Mediterranean and the Balkans (including Greece), populations show gradual increases in ancestry related to the new ‘Corded Ware’ type (EHG/Iran/EEF mixed) populations of the North European plain and western steppe, presumably resulting from a steady influx people from here. This is quite noticeable in Greece by about 1000 BC.

In the Middle East, there is continued mixeing of populations across Anatolia, Iran and the Caucasus. However, this mixing is biased toward the genetics of populations of the Southern Caucasus/NE Anatolia (and, perhaps even East Asia?).

In the steppe, populations end up becoming more like the European ‘Corded Ware’ in the next millennium or so, with the disappearance of purely Yamnaya-type populations. However, these populations also show increases in East Asian genetic components by the Iron Age, these effects being more extreme further East.

It would be a fair guess that the Eurasian steppe, allowing movements of people between Northern Europe and the East, is a major factor in these later changes.


As discussed elsewhere, the evidence above effectively shows that the Yamnaya and Corded Ware horizons are very likely to be associated with the migration from the east to Europe of IE speakers. The Bell Beaker phenomenon is a little more complicated, but must have been associated with IE speakers in the NW of Europe at least.

The question is whether the earlier migrations out of the Caucasus and/or southern Caspian region, both into the steppe to the north and into Anatolia and Greece to the south, could also have included IE speakers. Here, I’ll discuss individual aspects that might help pin this down.

1: Timing and wheels

The wheel is generally considered not to have been invented until around 4000 BC (some say 3500 BC, but that seems a bit late from what I can tell). As most IE language families apart from Anatolian (e.g. Hittite) have essentially the same word for wheel, it’s generally taken that Anatolian must have split from the other IE language families  around 4000 BC or a bit earlier, perhaps before the invention of the wheel. Other IE language families are thought to have separated after the invention of the wheel.

This is easy to accommodate if the homeland of PIE was in the steppe somewhere north of the Caucasus, as is the most common view. In this case, Anatolian split off first and went south (perhaps via the Balkans) sometime around 4000 BC, and the other languages split apart after the middle of the 4th millennium BC.

However if, alternatively, the Caucasus, NE Anatolia or NW Iran were argued to be a PIE homeland, is there evidence of the wheel before the movement of Caucasian/Iranian people north into the steppe. Frankly, it’s marginal, with a tendency toward ‘no’. The wheel would be being invented at about the same time or a little later than the Caucasus/Caspian migration into the Steppe.

If such a migration involved a movement of IE people north into the steppe and their isolation from their former IE neighbours to the south then, realistically, the Anatolian family would be the only IE family that could have been left behind in the south. All other IE language families would need to derive from the northern steppe IE peoples.

However, if the migration north took place after the separation of Anatolian, and simply involved a spread, (e.g. a connected network of people around the Caspian Sea) then no separation need be involved. The Anatolian family, having split earlier than the rest (which could have happened south of the Caspian Sea or wherever), would have no related word for wheel. The remaining circum-Caspian (say) linguistic community could share in the introduction of the wheel and wheel words around 4000 BC or a little later, albeit with dialect variations (e.g. *kwékwlos, *kwukwlos, *kwokwlos variants in the PIE word for ‘wheel’ – see this for example).

Maybe this seems a stretch. However, it is not impossible (such an idea has been discussed for the extended Yamnaya steppe homeland before, e.g. by Benjamin Fortson). The reason why I mention all this is…

2: What about Ukraine circa 4000 BC?

The arrival of kurgans (steppe-type burials) in the Balkans has long been seen as a sign of a major incursion of steppe ‘Ukrainians’ into the region, perhaps bringing in Indo-European languages, around 4000 BC. Some archaeologists, notably David Anthony, have argued that this was the time when the Anatolian branch of PIE split off from the rest of PIE.

Genetics tells a little of this story, with a minor influx of steppe populations to the NE Balkans. IE language introduction would, on this basis, require ‘elite domination’ to change the languages of the Balkans to those of the Ukrainian steppe, a process which is difficult but not impossible.

However, the much bigger shift appears to happen in the other direction, from the Balkans into western Ukraine around this time, with a major influx of Balkan EEF-type populations into western Ukraine (presumably as a result of immigration by Tripolye farmers). Steppe migrants into the Balkans would very likely have become linguistically isolated, as predicted by David Anthony.

As for the Ukraine, whatever language the steppe peoples spoke here before 3300 BC is probably irrelevant. Ukraine’s population appears to have been largely replaced at about this time by people of the Yamnaya culture from the region north of the Caspian Sea (the hybrid Iran/EHG population). There is little or no evidence of continuity in the genetic data for Ukraine. These Yamnaya people, who brought kurgans to the Balkans, are very likely to have brought IE languages to the Balkans too.

This means that if people of the Ukranian steppe were speaking some a very early PIE language before 4000 BC, that language had only one slim chance to be preserved, and that is in the minor 4000BC migrations of people into the Balkans and southward. Let’s see what that language’s chances were…

2: What about the Anatolian languages?

The Mathieson et al. (in review as of 2017) paper currently circulating in various forms, refutes one particular argument of David Anthony 2007 and others, which is that there was much migration from the steppe into Anatolia between 4000-2000 BC. This has been further backed up by Lazaridis et al. 2017.

This means that any migration to the Balkans around 4000 BC is unlikely to have affected Anatolia and, therefore, that Anatolian IE languages are unlikely to have got to Anatolia via the Balkan route. Any potential early PIE languages coming southwest from the Ukraine are therefore likely to have got stuck in the Balkans. We have no evidence for any such language as all well attested IE languages of the Balkans appear to be from the later migrations (Yamnaya or even later).

Instead, the evidence of an increase in genetic contribution from the Caucasus (or, less likely, Iran) suggests migration from the East into Anatolia during this period.

What this tells us about Anatolian languages is difficult to say. As Mathieson et al. state, the sampling in Anatolia is not extensive, and maybe they’ve just been unlucky in not sampling the right ancient people in Anatolia. However, there is generally quite a lot of consistency in their samples for different areas, so this seems questionable.

This leaves two theories for the Anatolian languages. The first is that they are home grown, as Colin Renfrew argued. Realistically, the likelihood of this is low, based on linguistic evidence of language replacement by Anatolian languages (oh how blind I was). The other is that Anatolian languages originated somewhere near or in the Caucasus (or Iran).

3: What about Armenian?

The Armenian language is a similar problem. The genetics of Armenia is largely non-steppe and appears to have been so since at least the 5th millennium BC, being mostly a mix of CHG and Anatolians/EEF. Since then genetic change in the area has been gently toward Iran, Anatolia and the Middle East. In fact, unlike northern Europeans, Armenians have not changed that much genetically in the last 6000 years. There is no particular evidence for a major immigration event during this time (there have been changes, but the major ones appear to be between CHG and Iranian influences)

I should mention the presence of ancient Y-haplogroup R1b1a1 in Armenia in an individual of the 3rd millennium BC, and of R1b1a1a and sub-clades from the 2nd millennium BC and 1st millennium BC. The first is ambiguous and could be due to male intrusion into the area of modern Armenia from the west or the steppe (more likely the steppe). The others are clearly due to steppe intrusion. What numbers of male individuals are implicated  and on how many occasions is difficult to say, but it could not have been large (about 20% – see Eurogenes for a good summary here).

Whilst the language of Armenian is not recorded in ancient texts (it’s earliest record is the 5th century AD) it appears to have been knocking around in its present area since at least the 1st millennium BC based on the evidence of loanwords into neighbouring Iron Age languages. Coupled with the genetic info, this means that either the precursors of Armenian have been in NE Anatolia since the 5th millennium BC or a small elite managed to change the language of this region before the 1st millennium BC, something which, as with the Anatolian languages, is quite hard to do.

In combination this makes a steppe origin for the Armenian language, arriving perhaps in 3rd millennium BC, possible but not very easy.

4: What about Greek?

Greece’s prehistory can be read in two ways. Greek as a language is clearly present in the Peloponnese by the middle of the 2nd millennium BC (as evidenced by Linear  B tablets). However, the genetics of Greece before around 3500 BC appear to be very much like other EEF populations or Anatolians, which may mean that it wasn’t an IE language that they were speaking then. This is backed up by the evidence of a ‘language substrate’, sometimes called Pre-Greek, in Greek language and geography. Therefore Greek was probably introduced at sometime between about 3500 and 1500 BC.

Greece’s genetic drift between 3500 and 1500 BC seems to fall in a varying path between initially moving toward the Caucasus/NE Anatolia, and later toward the new steppe populations of Europe. It’s a reasonable guess that Proto-Greek languages could have come to Greece after 3500 BC either from the Caucasus/NE Anatolia or from Yamnaya/Corded-Ware migrants in the Balkans.

However, Greek for various reasons is generally bundled by linguists with Armenian (and to a lesser extent Indo-Iranian). If the ancestors of Armenian really have been stuck in NW Anatolia since the 5th millennium BC and there is a connection between Greek and Armenian, doesn’t that suggest that the precursors of Greek might have been there with them? This would favour a Caucasus/NW Anatolian origin for Greek.

Whatever, there’s still that, (admittedly small) Yamnaya/Corded Ware component in Greek mainland populations of the second millennium BC. Notably, this is not seen in the populations of Crete (the ‘Minoans’) who appear to have spoken a language unknown but not Greek.

What about Indo-Iranian?

I’m simply not sure that we have enough data to say much at present.

Linguistic evidence first comes from Indo-Iranian (particularly Old Indic) loanwords and names in northern Syria and Anatolia from the middle of the 2nd millennium BC. These seem to be the result of minor intrusion of elite groups perhaps from the east or north. However, as is often the case, these elites had adopted local languages within a few hundred years. Secondly, by the first millennium BC people in the Pontic-Caspian steppe were speaking a form of Iranian language. Such a language was also being spoken in Iran by the middle of the millennium.

On the other hand, many others, notably Elena Kuz’mina, associate Indo-Iranians with the Sintashta culture of the 2000 BC southern Urals and, by extension, with the Andronovo phenomenon further east. As these cultures appear to have genetics more similar to those of the European Corded Ware than to the Yamnaya, this would mean that Indo-Iranian comes from northern Europe (oh, how those old German philologists would laugh).

For an IE language to come from Europe at this time is quite possible, as IE languages probably dominated the Corded Ware culture. However, it needs to be pointed out that the association of Indo-Iranians with any of this is not confirmed, being based on ritual similarities of the Andronovo with those mentioned in the Indian Rig Veda. Such similarities could be cultural and not linguistic.

Frankly, we’ll need more data to come from Indian and Iranian ancient DNA to answer whether there is a connection. Personally, I can’t help noticing the lack of a dominant Iran+EHG (steppish) hybrid signal in modern north Indians, Iranians and Afghans (i.e. people who speak Indo-Iranian languages) but plenty of evidence for a CHG or Iranian/Anatolian-type signal. This suggests more Middle Eastern than steppe input is now present here. But of course, populations change gradually over time so this may mean nothing.

As for NW Iran, the genetic data we have here currently suggest that the region was genetically converging with Anatolia up until around 4000 BC, after which date the ancient genetic evidence is missing. Between this time and the present Iran appears to show evidence of mild genetic influx from the steppe and, perhaps, the Caucasus. However, the details are sketchy. Whether this influx is enough to change the language to one from the steppe is open to debate.

What about Tocharian?

Tocharian is associated in many people’s minds with the Xiaohe mummies of the 2nd and 1st millennia BC Tarim Basin, China. It is also often associated with the Afanasievo culture of the Altai/Yenisei region of Russia, dated to the middle of the 3rd millennium BC. Afanasievo is found about seven hundred miles north of the Tarim basin so a migration from the Altai to the Tarim during the 3rd or 2nd millennium BC is envisaged. It should be mentioned that neither of these associations is necessarily true. In fact, one of its main advocates, James Mallory, is starting to have doubts.

When David Anthony wrote ‘The Horse, Wheel and Language’, his estimated dates for the separation of the Afanasievo culture from other Pontic-Caspian cultures were around the middle of the fourth millennium BC. As Tocharian is often considered (e.g. by Don Ringe) to be the second earliest IE family to separate (after Anatolian), this fitted nicely with Tocharian being the Afanasievo culture.

This is now more complicated as the Afanasievo culture dates to much the same time as the Corded Ware expansion, around 2800 BC, and appears to be its eastern arm. If Tocharian was part of this spread then it would show  linguistic traits that are no older than other IE languages (such as, say, Celtic). In fact not everyone agrees that Tocharian is so early in branching, and some linguists associate it with the Germanic family (for an excellent review of all this see Mallory 2015, e.g. p33). Alternatively, the linguists Gamkrelizde & Ivanov (who I’ll mention again in a moment), associate it with Italic and Celtic (currently, I think it probably is early to branch).

These last ideas make possible an alternative which I don’t think will be popular with Elena Kuz’mina or, probably, anyone.

What if Tocharian is not associated with the Afanasievo culture at all, but should instead be associated with its successor, the Andronovo phenomenon (and hence the Sintashta culture of the Urals)? This culture, essentially from the early second millennium BC, is generally accepted to be Indo-Iranian, so I won’t push this. However, it would at least explain Tocharian’s possible association with Germanic or Italo-Celtic, as Sintashta appears to be the result of back-migration of people from Corded Ware Europe, and Italo-Celtic at least is generally thought to have gone west with the Corded Ware migrations.

This also has the advantage of removing a difficult time-gap between the Afanasievo culture and the arrival of European-looking mummies (well, arrival before they were mummies, obviously) in the Tarim Basin. If the Andronovo spread carried Tocharian to the Hindu Kush, say, in the early 2nd millennium BC, then the Tarim Mummies could be the immediate, Tocharian successors of this spread.

This is all wild speculation and probably wrong. The only test I can suggest for this is that since Afanasievo people, like Yamnaya, were a Iran/EHG hybrid, whereas Andronovo showed the presence of EEF genes, looking for an EEF signature in the Tarim mummies may therefore help in narrowing the field, at least between these two cultures.

Don’t forget Gamkrelizde & Ivanov

Gamkrelizde & Ivanov's version of PIE spread, showing Italic, Celtic, Germanic and Balto-Slavic (the western IE group) and Tocharian spreading from the Pontic Caspian steppe (IE's secondary homeland), whereas Anatolian, Greek, Armenian and Indo-Iranian spread from south of the Caucasus.

Something approximating Gamkrelizde & Ivanov’s model of PIE homelands and spread.

Back in the 1980s two linguists, a Georgian, Tamaz Gamkrelidze, and a Russian, Vyacheslav Ivanov, made the case that the Proto-Indo-European homeland was in eastern Anatolia or NW Iran. Their model has the western dialects of Indo-European (Celtic, Italic, (???)Balto-Slavic and (???)Germanic) as well as Tocharian, what I’ll call North Caspian IE, coming from the steppe. However the precursors of North Caspian IE are argued to have come from a homeland south of the Caucasus.

Excluding Anatolian, which had already split off at an earlier date, the South Caspian IE languages, which remained south of the Caucasus, were argued to have spread west (in the case of Greek), stayed put (in the case of Armenian), or spread east below the Caspian Sea, through Iran into India and north into the steppe (in the case of Indo-Iranian).

(NB Albanian was argued to be a hybrid between a South Caspian language and a North Caspian one, formed around 1200 BC – However, Albanian, whilst a fine language I’m sure, is such a frazzled stub of IE that it’s very difficult to say more than that it might be less closely related to Balto-Slavic, Iranian, Greek, Armenian or German than to Italic, Celtic, Tocharian or Anatolian).

Gamkrelizde & Ivanov’s case is based on many small points of linguistic detail, and has been easy to refute based on timing, notably by Bill Darden. Much of this depends on the separation of North Caspian IE from South Caspian IE after the invention of the wheel sometime around 4000-3500 BC. If wheels were invented after the split, how come they have the same basic word for wheel? Is an extended circum-Caspian late PIE, including both North Caspian and South Caspian IE dialects, possible? It only has to be until the early fourth millennium, when the wheel was invented, so for a few hundred years at most.

Either way, based on the genetics it looks to me like such a model is possible, if not wholly probable. Say it turns out that, of all things, the heartland of PIE lay south of the Caucasus near the Caspian Sea, then conservative Anatolian languages could have spread slightly westward from here at some time in the 5th millennium as a result of population movements from here. A bit later or even at the same time, expansion around or via the Caspian sea north into the Caspian steppe could have allowed IE languages to be extended into the steppe as part of a circum-Caspian late PIE connected with the late Khvalynsk culture.

In this scenario much of the story in the north is not dissimilar to that made by David Anthony. North Caspian IE, in the form of Yamnaya, would still expand west into the Ukraine and edge into the eastern Balkans in the late 4th millennium BC. The great explosion west would be with the Corded Ware culture of the early 3rd millennium BC, where some Yamnaya peoples, perhaps based in the northeastern Balkans would spread rapidly across Northern Europe, introducing North Caspian IE languages (?including Phrygian). The story of the spread of Tocharian, whether with Afanasievo around 2700 BC or with Andronovo around 1900 BC, would have to be revised of course.

However, a related South Caspian IE would be the ancestor of Greek, Phrygian (no, not Phrygian) and Armenian. In this scenario Greek and Phrygian would spread west in the 3rd or 2nd millennia BC, perhaps carried by sailors from the lugubrious waters of the Black Sea into the Aegean. Whether South Caspian IE would also be the origin of Indo-Iranian, as opposed to Sintashta cultures, is even more speculative but in this scenario, quite possible.

Is this crazy. Probably. But at least one set of linguists seemed to have the same kind of idea… which is, sadly, more than can be said for Colin Renfrew’s ‘Archaeology & Language’.



