Thursday, July 20, 2017

An Algorithm For Counting Syllables

A while ago I was talking to a friend who works for a startup company doing natural-language processing. At some point we got to talking about syllabification, and I mentioned a syllable-counting algorithm I'd made ten years ago. He seemed to think that there weren't many such things floating around, so I thought I'd post the algorithm here. It's kind of interesting even if you aren't in the syllable-counting industry.

The algorithm is described below. The code is also included. Given a word as a string of A–Z letters, the algorithm returns a number which is likely to be the number of syllables in the word. The algorithm doesn't tell you where the syllable breaks should be; it just guesses how many syllables there are.

How good is the algorithm? I don't have much data on that, but it seems to work well enough for my purposes (making word games). The algorithm gets 93 of the following 100 words correct:

PYRAMIDALLY 5
MISERICORDS 4
THYROXINS 3
PLICATION 3
PERTINACIOUS 5
LEAL 1
LYCOPOD 3
FRIPPERIES 3
INTERDEPENDENCY 6
RITORNELLOS 4
PRESSORS 2
TINNY 2
POOFTER 2
PETTLED 2
BACKPEDALLING 4
CHERRYLIKE 3
INTERCONVERT 4
VILLEIN 2
STALEST 2
PREVISION 3
TERRITORIALITY 7
ALKYLS 2
WAMBLED 2
TRILINGUAL 3
MONKSHOOD 2
UNSCREWED 2
SCOTIAS 3
SYMPHONY 3
DREADFULS 2
FOCALIZATION 5
CONTRARIETY 4
HABERDASHER 4
UNCURIOUS 4
UNSWORE 2
MULTIVOLUME 4
CHEF 1
PITH 1
PENTAMIDINE 4
VISITATIONS 4
STEEPEN 2
SYNCOPATING 4
INTRIGANTS 3
SCANDIAS 3
BREGMATE 2
INACTIVATION 5
UNDERUTILIZES 6
FLOC 1
TUBAE 2
UNSYSTEMATIZED 5
HELLENIZE 3
VEXT 1
MUDSLINGERS 3
PHILANTHROPISTS 4
TALLISH 2
DOWNER 2
HODAD 2
OCHERS 2
ENFOLDING 3
CATWALKS 2
RAMPANT 2
PREORDAINMENT 4
TROUSSEAU 2
GRAVIDA 3
POLYWATER 4
INSUFFLATOR 4
MODERATENESSES 6
OUTCROWING 3
ACCOUCHEMENTS 4
UNDERDOGS 3
BRAIZE 1
VICEROYSHIP 4
CACHEXIAS 4
CUDDLIER 3
ASSIGNOR 3
SUEDED 2
TACTILITIES 4
SPHENOIDS 2
BANKROLLERS 3
OCTANTS 2
ANNUNCIATORY 6
BLACKLEAD 2
TITLEHOLDERS 4
REPRESSOR 3
RECKONERS 3
FAVOURS 2
OSAR 2
EDUCATION 4
UNCREATES 2
MEASURABILITIES 6
AROUSERS 3
DISASSEMBLY 4
LICHEE 2
MOTIVATOR 4
CONCEDEDLY 4
VELLUM 2
MERCHANDIZE 3
CONCOURSES 3
PHILISTIA 4
COUNTERINSURGENCY6
DUMBWAITERS 3

Here are 34 words that I used while developing the algorithm—they are tricky cases or else illustrate some of the exceptions being handled. The algorithm gets 27 of them correct.

AIDE 1
IDEA 3
IDEAS 2
IDEE 2
IDE 1
AIDA 2
PROUSTIAN 3
CHRISTIAN 3
CLICHE 1
HALIDE 2
TELEPHONE 3
TELEPHONY 4
DUE 1
IDEAL 2
DEE 1
UREA 3
VACUO 3
SEANCE 1
SAILED 1
RIBBED 1
MOPED 1
BLESSED 1
AGED 1
TOTED 2
WARRED 1
UNDERFED 2
JADED 2
INBRED 2
BRED 1
RED 1
STATES 1
TASTES 1
TESTES 1
UTILIZES 4

Some of the failures are interesting, because when you see the number of syllables assigned by the computer, you get a sense of how the computer is "misreading" the words. For example, cliché gets one syllable, so from the point of view of the algorithm, it is as if CLEE-SHAY were being read as CLEESH. (My dictionary doesn't include diacritic marks.) Some of the mispronunciations aren't far off the mark, in terms of patterns of pronunciation in English: testes gets one syllable, as if it were pronounced analogously to tastespertinacious gets 5 syllables, as if it were pronounced PER-TIN-A-SEE-US.

I'm surprised to see that the algorithm gets idea right but ideas wrong. I remember that an early version of the algorithm dropped terminal S's before applying the logic, but maybe that rule was worse, on balance. Clearly the algorithm could be improved in any number of ways, for example by adding more exception handling.

***

Here is the description from the comments to the code:

(*

Algorithm for counting syllables.


If the word has no consonants, then the number of syllables is the number of distinct letters in the word. Otherwise:

We count the 'leading' syllables by advancing through the word from left to right, examining pairs of consecutive letters.  If the pair is (vowel)(consonant), then we increment the syllable count by 1. Otherwise, we do nothing.

We next add syllables for each appearance of the vowel combination IA (as in EDITORIAL)

We next add syllables for each appearance of the vowel combination EO (as in PREORDAINED)

We add a syllable if the word ends in -IOUS or -IER

We next correct unwanted syllabification of (consonant+ED) endings.  We would be satisfied with giving certain ED endings their own syllable (BLESSED), but we don't want all ED endings to get a syllable.  The rule adopted here is that we deduct a syllable for an ED ending unless the prior letter is D, T, or ((consonant other than R)+R) or (consonant+L).

We next correct unwanted syllabification of ES endings.  The rule adopted here is that we deduct a syllable for an ES ending unless the prior letter is C, G, X, S, Z, or (consonant+L).

We now add 'final' syllables if necessary.  If the last letter is a consonant, or if the second-last letter is a consonant and the last letter is an E, then there are no final syllables. Otherwise, if the final string of 1 or more vowels matches one of a specified set of disyllabic pairs, then there are two final syllables; otherwise, there is one final syllable.

Finally, if the syllable count is less than 1, set the syllable count to 1.

The disyllabic combinations are specified to be EA,II,IO,UA,UO. (We don't have to include IA and EO because they have been handled above.)

*)

***

Here is a Mathematica implementation. It's ugly, in part because it's from ten years ago but mostly because I tend to code a project like this just about as fast as I can type, not worrying about planning or elegance.

Alphabet={"A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"};

Vowels={"A","E","I","O","U","Y"};

Consonants=Complement[Alphabet,Vowels];

VowelPositionTFList[LetterListArg_]:=
Table[
MemberQ[Vowels,LetterListArg[[k]]],
{k,1,Length[LetterListArg]}];

ConsonantPositionTFList[LetterListArg_]:=
Map[Not,VowelPositionTFList[LetterListArg]];

LastConsonantPosition[LetterListArg_]:=
Max[
Position[
ConsonantPositionTFList[LetterListArg],
True
]
];

CountVowels[LetterListArg_]:=
Count[
VowelPositionTFList[LetterListArg],
True
];

DisyllabicVowelPairs={{"E","A"},(* {"E","O"}, {"I","A"},*) {"I","I"},{"I","O"},{"U","A"},{"U","O"}};

CountSyllables[WordArg_]:=
{
CharacterList=Characters[WordArg];
NCharacters=Length[CharacterList];
If[
CountVowels[CharacterList]==NCharacters,
Length[Union[CharacterList]],
RawSyllableCount=0;
Do[
TempCharacter1=CharacterList[[LetterIndex]];
TempCharacter2=CharacterList[[LetterIndex+1]];
If[MemberQ[Vowels,TempCharacter1]&&MemberQ[Consonants,TempCharacter2],
RawSyllableCount=RawSyllableCount+1
],
{LetterIndex,1,NCharacters-1}];
CurrentSyllableCount=RawSyllableCount;
CurrentSyllableCount=CurrentSyllableCount+Length[StringPosition[WordArg,"IA"]];
CurrentSyllableCount=CurrentSyllableCount+Length[StringPosition[WordArg,"EO"]];
If[
NCharacters>4&&StringTake[WordArg,-4]=="IOUS",
CurrentSyllableCount=CurrentSyllableCount+1
];
If[
NCharacters>3&&StringTake[WordArg,-3]=="IER",
CurrentSyllableCount=CurrentSyllableCount+1
];
If[
NCharacters>2&&CharacterList[[-1]]=="D"&&CharacterList[[-2]]=="E"&&MemberQ[Consonants,CharacterList[[-3]]],
EDCorrection=-1;
If[
MemberQ[{"D","T"},CharacterList[[-3]]],
EDCorrection=0,
If[
((NCharacters>3)&&CharacterList[[-3]]=="R"&&CharacterList[[-4]]!="R")||((NCharacters>3)&&CharacterList[[-3]]=="L"&&MemberQ[Consonants,CharacterList[[-4]]]),
EDCorrection=0
]
],
EDCorrection=0
];
CurrentSyllableCount=CurrentSyllableCount+EDCorrection;
If[
NCharacters>2&&CharacterList[[-1]]=="S"&&CharacterList[[-2]]=="E"&&MemberQ[Consonants,CharacterList[[-3]]],
ESCorrection=-1;
If[
MemberQ[{"C","G","X","S","Z"},CharacterList[[-3]]],
ESCorrection=0,
If[
((NCharacters>3)&&CharacterList[[-3]]=="L"&&MemberQ[Consonants,CharacterList[[-4]]]),
ESCorrection=0
]
],
ESCorrection=0
];
CurrentSyllableCount=CurrentSyllableCount+ESCorrection;
TempPosition=LastConsonantPosition[CharacterList];
If[
(TempPosition==NCharacters)||((TempPosition==NCharacters-1)&&(CharacterList[[-1]]=="E")),
FinalSyllableCount=0,
VowelChunk=Take[CharacterList,{TempPosition+1,NCharacters}];
FinalSyllableCount=
If[
MemberQ[DisyllabicVowelPairs,VowelChunk],
2,
1
]
];
CurrentSyllableCount=CurrentSyllableCount+FinalSyllableCount;
Max[CurrentSyllableCount,1]
]
}[[1]];

No comments: