The algorithm is described below. The code is also included. Given a word as a string of A–Z letters, the algorithm returns a number which is likely to be the number of syllables in the word. The algorithm doesn't tell you where the syllable breaks should be; it just guesses how many syllables there are.

PYRAMIDALLY 5

MISERICORDS 4

THYROXINS 3

PLICATION 3

PERTINACIOUS 5

LEAL 1

LYCOPOD 3

FRIPPERIES 3

INTERDEPENDENCY 6

RITORNELLOS 4

PRESSORS 2

TINNY 2

POOFTER 2

PETTLED 2

BACKPEDALLING 4

CHERRYLIKE 3

INTERCONVERT 4

VILLEIN 2

STALEST 2

PREVISION 3

TERRITORIALITY 7

ALKYLS 2

WAMBLED 2

TRILINGUAL 3

MONKSHOOD 2

UNSCREWED 2

SCOTIAS 3

SYMPHONY 3

DREADFULS 2

FOCALIZATION 5

CONTRARIETY 4

HABERDASHER 4

UNCURIOUS 4

UNSWORE 2

MULTIVOLUME 4

CHEF 1

PITH 1

PENTAMIDINE 4

VISITATIONS 4

STEEPEN 2

SYNCOPATING 4

INTRIGANTS 3

SCANDIAS 3

BREGMATE 2

INACTIVATION 5

UNDERUTILIZES 6

FLOC 1

TUBAE 2

UNSYSTEMATIZED 5

HELLENIZE 3

VEXT 1

MUDSLINGERS 3

PHILANTHROPISTS 4

TALLISH 2

DOWNER 2

HODAD 2

OCHERS 2

ENFOLDING 3

CATWALKS 2

RAMPANT 2

PREORDAINMENT 4

TROUSSEAU 2

GRAVIDA 3

POLYWATER 4

INSUFFLATOR 4

MODERATENESSES 6

OUTCROWING 3

ACCOUCHEMENTS 4

UNDERDOGS 3

BRAIZE 1

VICEROYSHIP 4

CACHEXIAS 4

CUDDLIER 3

ASSIGNOR 3

SUEDED 2

TACTILITIES 4

SPHENOIDS 2

BANKROLLERS 3

OCTANTS 2

ANNUNCIATORY 6

BLACKLEAD 2

TITLEHOLDERS 4

REPRESSOR 3

RECKONERS 3

FAVOURS 2

OSAR 2

EDUCATION 4

UNCREATES 2

MEASURABILITIES 6

AROUSERS 3

DISASSEMBLY 4

LICHEE 2

MOTIVATOR 4

CONCEDEDLY 4

VELLUM 2

MERCHANDIZE 3

CONCOURSES 3

PHILISTIA 4

COUNTERINSURGENCY6

DUMBWAITERS 3

Here are 34 words that I used while developing the algorithm—they are tricky cases or else illustrate some of the exceptions being handled. The algorithm gets 27 of them correct.

AIDE 1

IDEA 3

IDEAS 2

IDEE 2

IDE 1

AIDA 2

PROUSTIAN 3

CHRISTIAN 3

CLICHE 1

HALIDE 2

TELEPHONE 3

TELEPHONY 4

DUE 1

IDEAL 2

DEE 1

UREA 3

VACUO 3

SEANCE 1

SAILED 1

RIBBED 1

MOPED 1

BLESSED 1

AGED 1

TOTED 2

WARRED 1

UNDERFED 2

JADED 2

INBRED 2

BRED 1

RED 1

STATES 1

TASTES 1

TESTES 1

UTILIZES 4

*clichÃ©*gets one syllable, so from the point of view of the algorithm, it is as if CLEE-SHAY were being read as CLEESH. (My dictionary doesn't include diacritic marks.) Some of the mispronunciations aren't far off the mark, in terms of patterns of pronunciation in English:

*testes*gets one syllable, as if it were pronounced analogously to

*tastes*;

*pertinacious*gets 5 syllables, as if it were pronounced PER-TIN-A-SEE-US.

I'm surprised to see that the algorithm gets

*idea*right but

*ideas*wrong. I remember that an early version of the algorithm dropped terminal S's before applying the logic, but maybe that rule was worse, on balance. Clearly the algorithm could be improved in any number of ways, for example by adding more exception handling.

***

Here is the description from the comments to the code:

(*

Algorithm for counting syllables.

If the word has no consonants, then the number of syllables is the number of distinct letters in the word. Otherwise:

We count the 'leading' syllables by advancing through the word from left to right, examining pairs of consecutive letters. If the pair is (vowel)(consonant), then we increment the syllable count by 1. Otherwise, we do nothing.

We next add syllables for each appearance of the vowel combination IA (as in EDITORIAL)

We next add syllables for each appearance of the vowel combination EO (as in PREORDAINED)

We add a syllable if the word ends in -IOUS or -IER

We next correct unwanted syllabification of (consonant+ED) endings. We would be satisfied with giving certain ED endings their own syllable (BLESSED), but we don't want all ED endings to get a syllable. The rule adopted here is that we deduct a syllable for an ED ending unless the prior letter is D, T, or ((consonant other than R)+R) or (consonant+L).

We next correct unwanted syllabification of ES endings. The rule adopted here is that we deduct a syllable for an ES ending unless the prior letter is C, G, X, S, Z, or (consonant+L).

Finally, if the syllable count is less than 1, set the syllable count to 1.

The disyllabic combinations are specified to be EA,II,IO,UA,UO. (We don't have to include IA and EO because they have been handled above.)

*)

***

Here is a Mathematica implementation. It's ugly, in part because it's from ten years ago but mostly because I tend to code a project like this just about as fast as I can type, not worrying about planning or elegance.

Alphabet={"A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"};

Vowels={"A","E","I","O","U","Y"};

Consonants=Complement[Alphabet,Vowels];

VowelPositionTFList[LetterListArg_]:=

Table[

MemberQ[Vowels,LetterListArg[[k]]],

{k,1,Length[LetterListArg]}];

ConsonantPositionTFList[LetterListArg_]:=

Map[Not,VowelPositionTFList[LetterListArg]];

LastConsonantPosition[LetterListArg_]:=

Max[

Position[

ConsonantPositionTFList[LetterListArg],

True

]

];

CountVowels[LetterListArg_]:=

Count[

VowelPositionTFList[LetterListArg],

True

];

DisyllabicVowelPairs={{"E","A"},(* {"E","O"}, {"I","A"},*) {"I","I"},{"I","O"},{"U","A"},{"U","O"}};

CountSyllables[WordArg_]:=

{

CharacterList=Characters[WordArg];

NCharacters=Length[CharacterList];

If[

CountVowels[CharacterList]==NCharacters,

Length[Union[CharacterList]],

RawSyllableCount=0;

Do[

TempCharacter1=CharacterList[[LetterIndex]];

TempCharacter2=CharacterList[[LetterIndex+1]];

If[MemberQ[Vowels,TempCharacter1]&&MemberQ[Consonants,TempCharacter2],

RawSyllableCount=RawSyllableCount+1

],

{LetterIndex,1,NCharacters-1}];

CurrentSyllableCount=RawSyllableCount;

CurrentSyllableCount=CurrentSyllableCount+Length[StringPosition[WordArg,"IA"]];

CurrentSyllableCount=CurrentSyllableCount+Length[StringPosition[WordArg,"EO"]];

If[

NCharacters>4&&StringTake[WordArg,-4]=="IOUS",

CurrentSyllableCount=CurrentSyllableCount+1

];

If[

NCharacters>3&&StringTake[WordArg,-3]=="IER",

CurrentSyllableCount=CurrentSyllableCount+1

];

If[

NCharacters>2&&CharacterList[[-1]]=="D"&&CharacterList[[-2]]=="E"&&MemberQ[Consonants,CharacterList[[-3]]],

EDCorrection=-1;

If[

MemberQ[{"D","T"},CharacterList[[-3]]],

EDCorrection=0,

If[

((NCharacters>3)&&CharacterList[[-3]]=="R"&&CharacterList[[-4]]!="R")||((NCharacters>3)&&CharacterList[[-3]]=="L"&&MemberQ[Consonants,CharacterList[[-4]]]),

EDCorrection=0

]

],

EDCorrection=0

];

CurrentSyllableCount=CurrentSyllableCount+EDCorrection;

If[

NCharacters>2&&CharacterList[[-1]]=="S"&&CharacterList[[-2]]=="E"&&MemberQ[Consonants,CharacterList[[-3]]],

ESCorrection=-1;

If[

MemberQ[{"C","G","X","S","Z"},CharacterList[[-3]]],

ESCorrection=0,

If[

((NCharacters>3)&&CharacterList[[-3]]=="L"&&MemberQ[Consonants,CharacterList[[-4]]]),

ESCorrection=0

]

],

ESCorrection=0

];

CurrentSyllableCount=CurrentSyllableCount+ESCorrection;

TempPosition=LastConsonantPosition[CharacterList];

If[

(TempPosition==NCharacters)||((TempPosition==NCharacters-1)&&(CharacterList[[-1]]=="E")),

FinalSyllableCount=0,

VowelChunk=Take[CharacterList,{TempPosition+1,NCharacters}];

FinalSyllableCount=

If[

MemberQ[DisyllabicVowelPairs,VowelChunk],

2,

1

]

];

CurrentSyllableCount=CurrentSyllableCount+FinalSyllableCount;

Max[CurrentSyllableCount,1]

]

}[[1]];

## No comments:

Post a Comment