Why tidyged?

One of the main characteristics I wanted for this package was to hide the complexity of the GEDCOM specification. I began with what I wanted the user interface to look like and worked backwards from there. I felt that adopting a tidyverse feel to the package would enhance readability and would significantly flatten the learning curve for those who are familiar with ggplot2.

Data structures

I spent a significant amount of time before writing any code considering my options for how the data would be stored under the hood. In one blog post I considered storing genealogical data in a relational table format as it is easier to deal with, but discounted it very quickly as it is not well suited to nested data (list columns are not ideal for this application).

I toyed with the idea of using an off-the-shelf open source product like GRAMPS but I found it awkward to use and wanted something where I was in complete control, taking full advantage of the strengths of R.

I also considered using data structures more suited to this type of data, such as JSON or graphs (using the igraph or data.tree package). However, I discovered it would be quite difficult representing some of the structures in the GEDCOM specification to my satisfaction.

It was some time before a better solution occurred to me, and that was to store the GEDCOM file almost as is, and just to split out the components of each line into their own columns, creating a tidy version of a GEDCOM file (hence the name tidyged). This would allow easy manipulation using existing tidyverse infrastructure, and some additional processing would be needed to deal with the ordered and nested nature of the file. Ironically, I had returned to the idea of a tidy dataframe that I had dismissed early on.

The tidyged object

The tidyged object is a tibble representation of a GEDCOM file. Before we see how this is structured, let’s see an example of a GEDCOM file:

readLines(system.file("extdata", "555SAMPLE.GED", package = "tidyged.io"))
#>  [1] "0 HEAD"                                                                                 
#>  [2] "1 GEDC"                                                                                 
#>  [3] "2 VERS 5.5.5"                                                                           
#>  [4] "2 FORM LINEAGE-LINKED"                                                                  
#>  [5] "3 VERS 5.5.5"                                                                           
#>  [6] "1 CHAR UTF-8"                                                                           
#>  [7] "1 SOUR GS"                                                                              
#>  [8] "2 NAME GEDCOM Specification"                                                            
#>  [9] "2 VERS 5.5.5"                                                                           
#> [10] "2 CORP gedcom.org"                                                                      
#> [11] "3 ADDR"                                                                                 
#> [12] "4 CITY LEIDEN"                                                                          
#> [13] "3 WWW www.gedcom.org"                                                                   
#> [14] "1 DATE 2 Oct 2019"                                                                      
#> [15] "2 TIME 0:00:00"                                                                         
#> [16] "1 FILE 555Sample.ged"                                                                   
#> [17] "1 LANG English"                                                                         
#> [18] "1 SUBM @U1@"                                                                            
#> [19] "0 @U1@ SUBM"                                                                            
#> [20] "1 NAME Reldon Poulson"                                                                  
#> [21] "1 ADDR "                                                                                
#> [22] "2 ADR1 1900 43rd Street West"                                                           
#> [23] "2 CITY Billings"                                                                        
#> [24] "2 STAE Montana"                                                                         
#> [25] "2 POST 68051"                                                                           
#> [26] "2 CTRY United States of America"                                                        
#> [27] "1 PHON +1 (406) 555-1232"                                                               
#> [28] "0 @I1@ INDI"                                                                            
#> [29] "1 NAME Robert Eugene /Williams/"                                                        
#> [30] "2 SURN Williams"                                                                        
#> [31] "2 GIVN Robert Eugene"                                                                   
#> [32] "1 SEX M"                                                                                
#> [33] "1 BIRT"                                                                                 
#> [34] "2 DATE 2 Oct 1822"                                                                      
#> [35] "2 PLAC Weston, Madison, Connecticut, United States of America"                          
#> [36] "2 SOUR @S1@"                                                                            
#> [37] "3 PAGE Sec. 2, p. 45"                                                                   
#> [38] "1 DEAT"                                                                                 
#> [39] "2 DATE 14 Apr 1905"                                                                     
#> [40] "2 PLAC Stamford, Fairfield, Connecticut, United States of America"                      
#> [41] "1 BURI"                                                                                 
#> [42] "2 PLAC Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America"
#> [43] "1 FAMS @F1@"                                                                            
#> [44] "1 FAMS @F2@"                                                                            
#> [45] "1 RESI "                                                                                
#> [46] "2 DATE from 1900 to 1905"                                                               
#> [47] "0 @I2@ INDI"                                                                            
#> [48] "1 NAME Mary Ann /Wilson/"                                                               
#> [49] "2 SURN Wilson"                                                                          
#> [50] "2 GIVN Mary Ann"                                                                        
#> [51] "1 SEX F"                                                                                
#> [52] "1 BIRT"                                                                                 
#> [53] "2 DATE BEF 1828"                                                                        
#> [54] "2 PLAC Connecticut, United States of America"                                           
#> [55] "1 FAMS @F1@"                                                                            
#> [56] "0 @I3@ INDI"                                                                            
#> [57] "1 NAME Joe /Williams/"                                                                  
#> [58] "2 SURN Williams"                                                                        
#> [59] "2 GIVN Joe"                                                                             
#> [60] "1 SEX M"                                                                                
#> [61] "1 BIRT"                                                                                 
#> [62] "2 DATE 11 Jun 1861"                                                                     
#> [63] "2 PLAC Idaho Falls, Bonneville, Idaho, United States of America"                        
#> [64] "1 FAMC @F1@"                                                                            
#> [65] "1 FAMC @F2@"                                                                            
#> [66] "2 PEDI adopted"                                                                         
#> [67] "1 ADOP "                                                                                
#> [68] "2 DATE 16 Mar 1864"                                                                     
#> [69] "0 @F1@ FAM"                                                                             
#> [70] "1 HUSB @I1@"                                                                            
#> [71] "1 WIFE @I2@"                                                                            
#> [72] "1 CHIL @I3@"                                                                            
#> [73] "1 MARR"                                                                                 
#> [74] "2 DATE Dec 1859"                                                                        
#> [75] "2 PLAC Rapid City, Pennington, South Dakota, United States of America"                  
#> [76] "0 @F2@ FAM"                                                                             
#> [77] "1 HUSB @I1@"                                                                            
#> [78] "1 CHIL @I3@"                                                                            
#> [79] "0 @S1@ SOUR"                                                                            
#> [80] "1 DATA"                                                                                 
#> [81] "2 EVEN BIRT, DEAT, MARR"                                                                
#> [82] "3 DATE FROM Jan 1820 TO DEC 1825"                                                       
#> [83] "3 PLAC Madison, Connecticut, United States of America"                                  
#> [84] "2 AGNC Madison County Court"                                                            
#> [85] "1 TITL Madison County Birth, Death, and Marriage Records"                               
#> [86] "1 ABBR Madison BMD Records"                                                             
#> [87] "1 REPO @R1@"                                                                            
#> [88] "2 CALN 13B-1234.01"                                                                     
#> [89] "0 @R1@ REPO"                                                                            
#> [90] "1 NAME Family History Library"                                                          
#> [91] "1 ADDR"                                                                                 
#> [92] "2 ADR1 35 N West Temple Street"                                                         
#> [93] "2 CITY Salt Lake City"                                                                  
#> [94] "2 STAE Utah"                                                                            
#> [95] "2 POST 84150"                                                                           
#> [96] "2 CTRY United States of America"                                                        
#> [97] "0 TRLR"

Lines in a GEDCOM file can have a number of components:

  • Level: The level in the hierarchical structure. This appears for every line;
  • Cross-reference identifier: A string (which looks like @XYZ@) that signals the beginning of a new record (apart from header and trailer);
  • Tag: A short string given immediately after the level or cross-reference identifier that indicates the type of information being provided on the line. These are controlled values. User-defined tags have been allowed in other GEDCOM programs, but they are discouraged here;
  • Cross-reference pointer: This links to another record in the file (which looks like @XYZ@). In the above example, the Family Group record beginning on line 69 references other Individual records who are members of the family;
  • Line value: The value associated with the tag. For example, on line 6, the CHARacter encoding for the file is given as UTF-8.

There will not be any lines that have all of these components. For example, the first line of records do not contain a line value. Also, a line will never have a cross-reference pointer and a line value defined. For this reason, the tidyged object treats cross-reference pointers as just another line value.

The tidyged representation of the above GEDCOM file is given below, using a sample dataset built into the package:

knitr::kable(tidyged::sample555)
level record tag value
0 HD HEAD
1 HD GEDC
2 HD VERS 5.5.5
2 HD FORM LINEAGE-LINKED
3 HD VERS 5.5.5
1 HD CHAR UTF-8
1 HD SOUR GS
2 HD NAME GEDCOM Specification
2 HD VERS 5.5.5
2 HD CORP gedcom.org
3 HD ADDR
4 HD CITY LEIDEN
3 HD WWW www.gedcom.org
1 HD DATE 2 OCT 2019
2 HD TIME 0:00:00
1 HD FILE 555Sample.ged
1 HD LANG English
1 HD SUBM @U1@
0 @U1@ SUBM
1 @U1@ NAME Reldon Poulson
1 @U1@ ADDR
2 @U1@ ADR1 1900 43rd Street West
2 @U1@ CITY Billings
2 @U1@ STAE Montana
2 @U1@ POST 68051
2 @U1@ CTRY United States of America
1 @U1@ PHON +1 (406) 555-1232
0 @I1@ INDI
1 @I1@ NAME Robert Eugene /Williams/
2 @I1@ SURN Williams
2 @I1@ GIVN Robert Eugene
1 @I1@ SEX M
1 @I1@ BIRT
2 @I1@ DATE 2 OCT 1822
2 @I1@ PLAC Weston, Madison, Connecticut, United States of America
2 @I1@ SOUR @S1@
3 @I1@ PAGE Sec. 2, p. 45
1 @I1@ DEAT
2 @I1@ DATE 14 APR 1905
2 @I1@ PLAC Stamford, Fairfield, Connecticut, United States of America
1 @I1@ BURI
2 @I1@ PLAC Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America
1 @I1@ FAMS @F1@
1 @I1@ FAMS @F2@
1 @I1@ RESI
2 @I1@ DATE FROM 1900 TO 1905
0 @I2@ INDI
1 @I2@ NAME Mary Ann /Wilson/
2 @I2@ SURN Wilson
2 @I2@ GIVN Mary Ann
1 @I2@ SEX F
1 @I2@ BIRT
2 @I2@ DATE BEF 1828
2 @I2@ PLAC Connecticut, United States of America
1 @I2@ FAMS @F1@
0 @I3@ INDI
1 @I3@ NAME Joe /Williams/
2 @I3@ SURN Williams
2 @I3@ GIVN Joe
1 @I3@ SEX M
1 @I3@ BIRT
2 @I3@ DATE 11 JUN 1861
2 @I3@ PLAC Idaho Falls, Bonneville, Idaho, United States of America
1 @I3@ FAMC @F1@
1 @I3@ FAMC @F2@
2 @I3@ PEDI adopted
1 @I3@ ADOP
2 @I3@ DATE 16 MAR 1864
0 @F1@ FAM
1 @F1@ HUSB @I1@
1 @F1@ WIFE @I2@
1 @F1@ CHIL @I3@
1 @F1@ MARR
2 @F1@ DATE DEC 1859
2 @F1@ PLAC Rapid City, Pennington, South Dakota, United States of America
0 @F2@ FAM
1 @F2@ HUSB @I1@
1 @F2@ CHIL @I3@
0 @S1@ SOUR
1 @S1@ DATA
2 @S1@ EVEN BIRT, DEAT, MARR
3 @S1@ DATE FROM JAN 1820 TO DEC 1825
3 @S1@ PLAC Madison, Connecticut, United States of America
2 @S1@ AGNC Madison County Court
1 @S1@ TITL Madison County Birth, Death, and Marriage Records
1 @S1@ ABBR Madison BMD Records
1 @S1@ REPO @R1@
2 @S1@ CALN 13B-1234.01
0 @R1@ REPO
1 @R1@ NAME Family History Library
1 @R1@ ADDR
2 @R1@ ADR1 35 N West Temple Street
2 @R1@ CITY Salt Lake City
2 @R1@ STAE Utah
2 @R1@ POST 84150
2 @R1@ CTRY United States of America
0 TR TRLR

The cross-reference identifier is given in the record column. Event though these aren’t given for the header and trailer records they are assigned as “HD” and “TR” respectively in the tidyged object.

GEDCOM files can be imported and exported using the tidyged.io package. Many GEDCOM processors modify the data on import, perhaps ignoring custom tags. This package reads the file as-is, and the only modification of the file is in line with that recommended in the GEDCOM 5.5.5 specification, specifically; replacing double ‘@’ symbols with single ones, merging CONC/CONT lines, and ensuring appropriate capitalisation of certain values. In general, the package sticks rigidly to the GEDCOM specification and does not use custom tags. There is a real issue with other desktop applications creating their own flavours of file and diverging from the specification; tidyged does not and will never do this.

If you want to modify single values in a GEDCOM or do a simple Find/Replace, then this is straightforward to achieve with a simple text editor.

Next article: The gedcompendium >