A Daitch-Mokotoff Soundex Function for R
To cite it: Perdoncin Anton and Mercklé Pierre, “A Daitch-Mokotoff Soundex for R”, 2021, Lubartworld ERC Project, https://lubartworld.cnrs.fr/en/daitch-mokotoff-soundex-for-r.
Aidelman, Ajdelman, Edelman, Ejdelman, Morgenstern, Morgensztern, Morgiensztern, Raizl, Rachel, Ruchla, Rajzla, Rechla, Leibush, Lejbus, Lejbusz: these four lists of patronyms and first names sound the same, but are not spelled identically. How is it possible to detect automatically the phonetic correspondence between orthographic variants of the same names?
Word cloud of Lubartowian family names registred in the Lubartów register of population
This generic linguistic problem is reasserted in the context of research projects aiming at following and reconstructing individual and collective trajectories and biographies, relying on multiple sources and corpus of observations gathered from archives (such as administrative records, personal files, images, censuses, lists or registers) . Major methodological issues arise around the need to identify individuals that are recorded several times in a single source, or in different sources. We call “duplicates” these separate records corresponding to the same person. Duplicates can occur for many reasons, including in most cases the comparison and assembling of observations coming from distinct archival or documentary sources.
Duplicates can also appear in a unique source or a single register, when such a source or register is likely to record the same individuals several times successively. Whether it is to match individuals with themselves across sources or inside a source, the problem remains the same: how can one know that a given individual is duplicated? How can the researcher identify individuals, according to which criteria?
Questions Raised by the Lubartworld Project
The ERC Lubartworld research project is concerned with both issues: on the one hand the Register of populations opened in 1932 by local authorities is dynamic and may record several times some individuals that moved inside, or in and out the town; on the other hand, the whole research project is about tracking Lubartowians across border, in a great variety of sources worldwide. It may seem obvious to consider duplicates as a scoria in data, which should be systematically removed. On the contrary, identifying duplicates is of the utmost importance for two main reasons: first, in some statistical analysis counting the same individual several times may lead to misinterpretations; and second, duplicates are essential to document trajectories, careers or migrations. The ultimate objective is therefore to manage to keep and eventually match themFor instance, in a research on academic careers, Olivier Godechot and Alexandra Louvet used the … Continue reading. Our case is therefore the following: we seek to “follow” the inhabitants of Lubartów within a register of population, but also in other sources, in order to analyse their migration, persecution and socio-professional trajectories.
Whether to remove them temporarily from the dataset or keep and use them for analysis, the question remains: how can one detect duplicated records in a database? Amongst many other criteriaFamily names, firstnames, dates of birth, social security or diverse identification numbers, etc. … Continue reading, family names and first names can be used to find duplicates. But names are, by no means, a straighforward and unambiguous piece of information. The risk of homonymy is not the only challenge: names can change, and not only as a personal choice, or because women get married and take their husband surname (Lapierre 1995; Bouquet et al. 2013) . Misspellings, phonetic transcriptions or alphabetic changes can prevent the researcher from identifying individuals. This get more frequent as phenomenon studied are located in multilingual societies or states, and as individuals migrate, and are recorded in sources built in different national and linguistic contexts. Dates of birth can also change, due to recording errors, uncertainties revolving around civil registries, incoherent and strategic declarations by individuals, etc.
In the case of the Lubartworld project serious doubts can be raised about the consistence of the registration of surnames and first names. It is indeed frequent that these surnames and first names be transcribed with different spellings – in the same or in different sources – for the same individual, or even for members of the same family. The problem is therefore not whether Jean Perrin is Jean Perrin, but whether Pierre Mercklé is Pierre Merckel (as it is frequently misspelled), and Anton Perdoncin is Antoine Perdonin… Or, to cite examples from the Lubartów registers rather than our own names: is Josef-Hersz Honigsblum the same person as Josef-Hersz Honiksblum? and Gitla Akiersztajn the same person as Chaja-Gitla Akiersztejn?
The Daitch-Mokotoff Soundex
In order to figure out whether individual records with slightly different surnames and/or first names refer to same person, we need a tool that allows estimating the “distance” between the pronunciations of these different names. This kind of tool is called a “Soundex” algorithm, i.e. a phonetic algorithm that indexes names by sound, thus allowing matching words that are not identically written but sound the same.
The principle of such an algorithm is fairly easy to grasp: it usually removes vowels (unless it is the first letter) and then encodes homophone consonants with the same index (usually a number). The generic and more commonly used Soundex algorithm is meant to function with English pronunciation. Therefore, to identify duplicated records in an Eastern European context we need to use a specific Soundex algorithm, called after its two inventors: the Daitch-Mokotoff Soundex System (DM Soundex). This is a variant from the classical Soundex phonetic algorithm that is adapted to manage Eastern European patronymics. It was developed first in 1985 by Gary Mokotoff, a computer science ingineer involved in Jewish genealogy, in an attempt to index the names of 28,000 persons who changed names in Palestine from 1921 to 1948: “Using the conventional U.S. government system, which is based on the Russell system, many Eastern European Jewish names which sound the same did not soundex the same. The most prevalent were those names spelled interchangeably with the letter w or v,for example, the names Moskowitz and Moskovitz” (Mokotoff 1997) This work contributed to the computerization of the National Registry of Jewish Holocaust … Continue reading. The scheme was expanded in 1986 by another Jewish genealogist, Randy Daitch, and then released.
The Daitch-Mokotoff Soundex System is based on seven main rules:
1. Names are coded to six digits, each digit representing a sound, accordingly to the Daitch-Mokotoff Coding Chart below.
2. Vowels and the letter J are ignored, except at the beginning of the name or when two of them form a pair and the pair comes before a vowel, as in Breuer (791900). Likewise, the letter H is coded at the beginning of a name or when preceding a vowel, otherwise it is not coded.
3. Adjacent letters which combine to form a larger sound are given the code number of the larger sound, as in Berkowitz, which is coded Berkowi-tz (795740) and not coded Berkowi-t-z (795734).
4. Adjacent letters with the same code number are coded as one single sound. Exceptions to this rule are the combinations MN and NM, whose letters are coded separately, as in Kleinman which is coded 586660 not 586600.
5. Names consisting in more than one word are coded as one single word, after removing separating hyphens and spaces.
6. Several letters and letter combinations pose the problem that they may sound in different ways. The letter and letter combinations CH, CK, C, J and RZ (see chart below), are assigned two possible code numbers, thus resulting in some names having multiple soundexes instead of a single one.
7. When the letters of a name are thoroughly coded in less than six digits, the remaining digits are coded 0, as in Berlin (798600) which has only four coded sounds (B-R-L-N).
Implementation of the Daitch-Mokotoff Soundex algorithm in R
The Daitch-Mokotoff Soundex System has become the standard of most Jewish genealogical indexes. For instance, it is implemented in JewishGen’s Holocaust Database search instrument, and in Yad Vashem’s Central Database of Shoah Victims’ Name. In these cases, the algorithm runs in the backoffice, enabling anyone looking for a person to access a variety of records containing names that “sound like” this person’s name. But if one want to actually calculate DM Soundex codes, it becomes more complicated. The JOS Soundex Calculator on Jewishgen, only allows calculations for single words. Steve Morse’s website proposes a very handy tool to calculate DM Soundex (in its Beider-Morse version), but on a copy/paste basis that does not allow automatic and supervised replication of results.
This is why, in order to facilitate the identification of duplicates in the Lubartworld project and the integration of results in a data management and treatment process, we needed to integrate the DM Soundex algorithm in R. Before coding our own function, we looked at what already existed. There are at least three alternative Soundex implementations in R, but none of them – and no function that we know of in R – implement the Daitch-Mokotoff variant:
1. a phonetic() function is implemented in the stringdist package;
2. a soundex() function is implemented in the RecordLinkage package;
3. and a function also called soundex() is implemented in the phonics package.
Rather than simply “translating” the Apache or Python codes that are implemented in the online tools mentioned above, we have opted for the development of an alternative approach, hopefully simpler and faster. Daitch-Mokotoff specifications are far more complex than the generic Soundex algorithm, because of n-gram coding and subsequent branching, which lead to the possibility of generating different soundexes for the same name As stated in rule 6 of the DM Soundex System, single letters and groups of letters can be … Continue reading. The iterative approach – starting from the first n-gram and then proceeding until the end of the word – implemented by the available scripts mentioned above indeed appears to be very time-consuming. Coding in R allows to take advantage of the vectorial logic of this programming language. It results in a soundex() function, that operates on words considered as a whole, and codes as follow, on all submitted names at once:
1. Names are cleaned to remove accents, non-alphabetic characters, hyphens and spaces.
2. In this order, beginnings, ends, sounds before a vowel and eventually all other remaining letters are coded according to a Coding Chart that is adapted from the original one to include a few ambiguous cases.
3. Names with one or more letters or groups of letters that can be indexed in two different ways (see rule 6 above) are coded in as many soundexes as there are possible combinations of their codes.
4. In the resulting combinations, identical adjacent numbers are replaced by one single number (see rule 4 above).
5. Soundexes are cut or extended (with trailing zeros) to six digits.
Testing the R soundex function on different samples of Eastern European names
Since we coded the Daitch-Mokotoff algorithm in a different manner than Steve Morse on his very handy webpage (vectorially rather than iterativelyIt mean that we treated chains of characters as a whole: the function replaces all n-grams by codes … Continue reading), we need to check whether it still returns accurate soundexes. To this end, we ran our R function on three different samples of Eastern European names:
1. our own sample from the Lubartów registers: names and surnames of individuals that lived in this little Eastern Polish town in the 1930s and 1940s;
2. a large dataset of contemporary family names from all over the world;
3. a large dataset of contemporary Polish surnames.
Testing the function on the Lubartów Register
The 11,950 records of the Lubartów 1932 Register correspond to 2,311 distinct surnames and 3,613 distinct first names. After removing hyphens and spaces, we get 2,300 surnames and 803 first names. The soundex() R function was successfully tested on this combined list of 3,103 first and surnames. The soundexes for this list of names were first generated with Steve Morse’s online calculator. Among these 3,103 names, 2,236 (72.1 %) generate unique soundex codes, and 867 (27.9 %) generate multiple soundex codes, as they contain letters or letter combinations (such as CH, CK, C, J and RZ) that may sound in one of two ways (see rule 6 and Coding Chart above). Then we compared theses soundexes with those computed by our function, and obtained a perfect match: overall, 100% of our soundex codes and code combinations are perfectly identical to those generated with Morse’s online calculator. It is worth noting that to obtain results that perfectly match the soundexes generated by Morse’s online calculator, a few adaptations have been made to the rules presented above: RS is supposed to be coded as 94 or 4 in the original rules, but Morse codes it as 94, as we thus also doFor an explanation of the Beider-Morse procedures, see https://stevemorse.org/phonetics/bmpm.htm … Continue reading.
Testing the function on a big dataset of world family names
This first test shows that our soundex() function is correct for the Lubartów register, but it doesn’t prove that it returns a 100% correct soundex for each and every possible name. Since the function is not only meant for our own use, but will be released in a package for a wider use, we have to test its accuracy on a much larger set of names. For this second test we use the very large database of surnames from all over the world gathered by Philippe Rémy from a wide range of sources, and made available on his Github repository. This dataset comprises 97147 surnames, and the test of our soundex() function returns the following results: overall, 99.78% of soundex codes are correctly computed by the function, and only 0.19% of cases return one or several soundexes computed by our function that do not match those computed by Steve Morse’s algorithm (see Table above).
Let’s examine one of those “all wrong” matches, BLASENHAUER: Steve Morse codes it 784657, as does JewishGen, but our function codes it 784679. This means that our function doesn’t code the H before AU as an H before a vowel. Such small differences generate a very limited proportion of wrong matches, and can be disregarded as they do not result in systematic errors or divergences between our procedure and Steve Morse’s algorithm (see list of wrong matches in Table above).
Testing the function on a big dataset of contemporary Polish surnames
In order to assert the robustness of our function, we lastly we tested it on a big dataset of contemporary Polish surnames (available on the Polish government open data website. Overall, 99.78% of soundex codes or code combinations (in case of multiple soundexes) are perfectly computed by the function, and in only 0.19% cases, none of the soundexes computed by our function match those computed by Steve Morse’s algorithm. In these very few cases, differences can again be explained by the order in which rules are applied.
How to use the soundex function?
The code (.Rmd) of this article and the data needed to replicate tests can be downloaded below. It comprises the code of the function.
The function is also available in the new R package datatools, that can be downloaded on GitHub : https://github.com/pmerckle/datatools. Updates and modifications will be implemented in the package, but may not be added in this article source code: we therefore strongly recommend the use of the datatools package. The datatools package is still under construction, but it is fully operational.
The soundex() R function that computes Daitch-Mokotoff soundexes is perfectly accurate on the Lubartów Register, but returns soundexes that may be different from those computed by Morse’s algorithm in a very limited number of cases (approximately 0.2%). Though it is likely that Morse’s version is more faithful to the original Daitch-Mokotoff Coding Chart than our R function, we consider that the latter is nonetheless an acceptable implementation of the Daitch-Mokotoff Soundex algorithm in R. It is relatively fast (according to R standards at least): the function codes more than 2,400 names per second when run on an average performance laptop computer, and up to XXX names par seconds on a high performance desktop computer. Errors are very rare, mainly concentrated on names that are not of Eastern European origin. And, above all, these errors are likely to be consistent, which means that they will recur identically when using the algorithm on different databases. The fact that our function returns once in a while results that differ from Morse’s algorithm will not impair its ability to detect identically sounding names in databases.
This function opens up new research perspectives for the Lubartworld project, as it enables us to compare – and eventually match – names gathered in different sources. For instance, an ongoing research of Lubartowians in genealogy databases (e.g. Ancestry.com) provides us with lists of persons that should be traced back in our Polish source. Zipora Rotsztejn (coded 943600), found on Ancestry.com would then have to be compared to all individuals whose names are coded the same in the Registre: Rojtsztajn, Rotsztajn and Rotsztejn. Work in progress…
|↾1||For instance, in a research on academic careers, Olivier Godechot and Alexandra Louvet used the DOCTHESE database of PhD dissertations defended in France to track researchers who first appear in the database as a PhD and then reappear a few years later as a PhD director (Godechot and Louvet 2010). In this research the date of birth is not available. The question then asked by Godechot and Louvet is: “Is Jean Perrin Jean Perrin?”. In other words, is the researcher who appears with this name as thesis director in the database from 1979 onward, the same as the one who bears this name and defended a PhD dissertation in Grenoble seven years earlier, in 1972?|
|↾2||Family names, firstnames, dates of birth, social security or diverse identification numbers, etc. All these modes of identification have a history and should never been taken for granted. Discussion of this aspect of the problem is beyond the scope of this article. Another article will be published in order to adress the question of the identification of individuals in historical sources.|
|↾3||This work contributed to the computerization of the National Registry of Jewish Holocaust Survivors, now located in the United States Holocaust Memorial Museum (USHMM).|
|↾4||As stated in rule 6 of the DM Soundex System, single letters and groups of letters can be interpreted differently phonetically, and therefore produce different codes. For instance, as explained in the Daitch-Mokotoff Coding Chart above, the unigram “C” can be interpreted as the unigram “K” (therefore producing the code 5) or as the bigram “TZ” (therefore producing the code 4). This complexe procedure results in branching, meaning that a name can correspond to more than one Soundex code. For instance, “Cukierberg” can be coded “559795” (as “Kukierberg”) or “459795” (as “Zukierberg”).|
|↾5||It mean that we treated chains of characters as a whole: the function replaces all n-grams by codes at once, and not iteratively (starting from the beginning until the end).|
|↾6||For an explanation of the Beider-Morse procedures, see https://stevemorse.org/phonetics/bmpm.htm and https://stevemorse.org/phonetics/bmpm2.htm.|