6 minutes

How to link lives

For a research project like Link-Lives, the major question is how to indeed link lives.

At first, it should not seem too difficult to link information from different sources about one individual but to do it on a large-scale basis is something else.

In research, we use vital personal information such as first name, surname, date of birth and place of birth to link historical individuals. If there is a match between two sets of vital personal information in two different sources, it is a link. We can include additional personal information to create links like occupation and residence. However, if we use this information systematically, bias can occur in the created links because then we only capture the people who never moved or changed jobs. The result would be a database with people who lived primarily stable lives, which would not be a true representation of the Danish population in the 1800s.

One of the key issues in linking historical individuals is the quality of the source material since it is not given that each source contains the same type of personal information. For example, it is relatively easier to create links after 1901 because from that year the enumerators recorded the entire date of birth instead of only an age as in previous censuses. Similarly, it is more difficult to link individuals before 1845 because place of birth was not recorded in these early censuses. A possible solution to circumvent the last issue is to incorporate family linkage (1). This means that we take into account each individual’s relation to other people in the household whether that being children, parents or a spouse. For example there may be a lot of potential matches for a Hans Christensen born in Copenhagen in 1856 but when we know that in 1880 he was married to Anna born in Aarhus and had three kids, we can use that information to see whether any of the potential Hans Christensen’s was living in 1885 with any of them.

Three different approaches

There are three different approaches to create links across source material within historical demographic research: a trained historian, rule-based by computer or machine learning.

Trained historians or other skilled academics typically link through looking up every person (or nearly every person) that appears in one source against the second source. They can immediately see and confirm links between two sets of information, which gives the resulting links a high quality standard. The downside is that it is a very time-consuming approach, which can only be done for a small sample of one hundred or maybe a few thousands individuals. It also runs the risk of involving a degree of subjective choice, which makes it difficult for other researches to redo the same linking process. The majority of the links will be agreed by many researchers but unless very explicit rules are used, it is possible that different researchers may give weight to different information in unclear cases: some may prefer a candidate with the same birth place but a slightly different name while another may rather prefer the candidate with a precise matching name living in the neighbourhood parish.

Creating links through a computer rule-based approach is a lot faster than having trained historians to do it. This way of creating links is systematic and transparent, which makes it possible for others to redo and check the resulting links. It involves comparing a given person in a source with all the possible candidates in a second source and defining how similar they have to be for a given match to be considered a link. However, not everyone has to be compared to everyone: Maria Jensen born in Copenhagen should not be compared to all people in Denmark but only to women born in Copenhagen. Comparison is made by calculating how different names are, as for example, Hanna and Anna have only one character difference, Kristiansen and Christiansen have two characters and we can easily compute the numbers of years between two birth years or ages. The challenge is how to combine these possible differences into rules that capture only the real matches. The downside is that the linkage rate is limited to only the relatively easier cases, where the degree of difference between two records is not very large. If two records still belong to the same person but the differences are more than those accepted by the rules, the possible match is left unlinked.

The third option is to use machine learning to identify which pairs of records are a link (2). Machine learning requires a set of initial training data of true and false links which has been created or verified by trained historians. While the program also uses a comparison between the differences between records like the rule-based approach, there is not outwardly defined rules, and the computer derives from the training data what historians consider a link and not a link. With that learning, the machine learning algorithm is able to use the comparison between the records and predict which pairs of records are actual links.

Both explicit rules and machine learning techniques can predict easily which pairs are links and which are not for millions of records. However, they will both miss some true links and assign link pairs that are not actual links, which is what we call false negatives and false positives. To minimize these, Link-Lives complement the automatic assignation with manual validation (performed by historians), which can help to better calibrate the programs to create continuously more reliable links.
Link-Lives will test and use all three approaches in the linking of Danish censuses, parish registers and burial records. To begin with, Link-Lives will focus on creating the easy links and then afterwards move onto tackling the more difficult ones.

References:
(1) Özgür Akgün, Alan Dearle, Graham Kirby, Eilidh Garreth, Tom Dalton, Peter Christen, Chris Dibben & Lee Williamson, “Linking Scottish vital event records using family groups” in Historical Methods: A Journal of Quantitative and Interdisciplinary History, (March, 2019).

(2) Ron Goeken, Lap Huynh, T. A. Lynch & Rebecca Vick, “New Methods of Census Record Linking” in Historical Methods: A Journal of Quantitative and Interdisciplinary History , (44:1, 2011).