One of the main characteristics I wanted for this package was to hide the complexity of the GEDCOM specification. I began with what I wanted the user interface to look like and worked backwards from there. I felt that adopting a tidyverse
feel to the package would enhance readability and would significantly flatten the learning curve for those who are familiar with ggplot2
.
I spent a significant amount of time before writing any code considering my options for how the data would be stored under the hood. In one blog post I considered storing genealogical data in a relational table format as it is easier to deal with, but discounted it very quickly as it is not well suited to nested data (list columns are not ideal for this application).
I toyed with the idea of using an off-the-shelf open source product like GRAMPS but I found it awkward to use and wanted something where I was in complete control, taking full advantage of the strengths of R.
I also considered using data structures more suited to this type of data, such as JSON or graphs (using the igraph
or data.tree
package). However, I discovered it would be quite difficult representing some of the structures in the GEDCOM specification to my satisfaction.
It was some time before a better solution occurred to me, and that was to store the GEDCOM file almost as is, and just to split out the components of each line into their own columns, creating a tidy version of a GEDCOM file (hence the name tidyged
). This would allow easy manipulation using existing tidyverse
infrastructure, and some additional processing would be needed to deal with the ordered and nested nature of the file. Ironically, I had returned to the idea of a tidy dataframe that I had dismissed early on.
The tidyged
object is a tibble representation of a GEDCOM file. Before we see how this is structured, let’s see an example of a GEDCOM file:
readLines(system.file("extdata", "555SAMPLE.GED", package = "tidyged.io"))
#> [1] "0 HEAD"
#> [2] "1 GEDC"
#> [3] "2 VERS 5.5.5"
#> [4] "2 FORM LINEAGE-LINKED"
#> [5] "3 VERS 5.5.5"
#> [6] "1 CHAR UTF-8"
#> [7] "1 SOUR GS"
#> [8] "2 NAME GEDCOM Specification"
#> [9] "2 VERS 5.5.5"
#> [10] "2 CORP gedcom.org"
#> [11] "3 ADDR"
#> [12] "4 CITY LEIDEN"
#> [13] "3 WWW www.gedcom.org"
#> [14] "1 DATE 2 Oct 2019"
#> [15] "2 TIME 0:00:00"
#> [16] "1 FILE 555Sample.ged"
#> [17] "1 LANG English"
#> [18] "1 SUBM @U1@"
#> [19] "0 @U1@ SUBM"
#> [20] "1 NAME Reldon Poulson"
#> [21] "1 ADDR "
#> [22] "2 ADR1 1900 43rd Street West"
#> [23] "2 CITY Billings"
#> [24] "2 STAE Montana"
#> [25] "2 POST 68051"
#> [26] "2 CTRY United States of America"
#> [27] "1 PHON +1 (406) 555-1232"
#> [28] "0 @I1@ INDI"
#> [29] "1 NAME Robert Eugene /Williams/"
#> [30] "2 SURN Williams"
#> [31] "2 GIVN Robert Eugene"
#> [32] "1 SEX M"
#> [33] "1 BIRT"
#> [34] "2 DATE 2 Oct 1822"
#> [35] "2 PLAC Weston, Madison, Connecticut, United States of America"
#> [36] "2 SOUR @S1@"
#> [37] "3 PAGE Sec. 2, p. 45"
#> [38] "1 DEAT"
#> [39] "2 DATE 14 Apr 1905"
#> [40] "2 PLAC Stamford, Fairfield, Connecticut, United States of America"
#> [41] "1 BURI"
#> [42] "2 PLAC Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America"
#> [43] "1 FAMS @F1@"
#> [44] "1 FAMS @F2@"
#> [45] "1 RESI "
#> [46] "2 DATE from 1900 to 1905"
#> [47] "0 @I2@ INDI"
#> [48] "1 NAME Mary Ann /Wilson/"
#> [49] "2 SURN Wilson"
#> [50] "2 GIVN Mary Ann"
#> [51] "1 SEX F"
#> [52] "1 BIRT"
#> [53] "2 DATE BEF 1828"
#> [54] "2 PLAC Connecticut, United States of America"
#> [55] "1 FAMS @F1@"
#> [56] "0 @I3@ INDI"
#> [57] "1 NAME Joe /Williams/"
#> [58] "2 SURN Williams"
#> [59] "2 GIVN Joe"
#> [60] "1 SEX M"
#> [61] "1 BIRT"
#> [62] "2 DATE 11 Jun 1861"
#> [63] "2 PLAC Idaho Falls, Bonneville, Idaho, United States of America"
#> [64] "1 FAMC @F1@"
#> [65] "1 FAMC @F2@"
#> [66] "2 PEDI adopted"
#> [67] "1 ADOP "
#> [68] "2 DATE 16 Mar 1864"
#> [69] "0 @F1@ FAM"
#> [70] "1 HUSB @I1@"
#> [71] "1 WIFE @I2@"
#> [72] "1 CHIL @I3@"
#> [73] "1 MARR"
#> [74] "2 DATE Dec 1859"
#> [75] "2 PLAC Rapid City, Pennington, South Dakota, United States of America"
#> [76] "0 @F2@ FAM"
#> [77] "1 HUSB @I1@"
#> [78] "1 CHIL @I3@"
#> [79] "0 @S1@ SOUR"
#> [80] "1 DATA"
#> [81] "2 EVEN BIRT, DEAT, MARR"
#> [82] "3 DATE FROM Jan 1820 TO DEC 1825"
#> [83] "3 PLAC Madison, Connecticut, United States of America"
#> [84] "2 AGNC Madison County Court"
#> [85] "1 TITL Madison County Birth, Death, and Marriage Records"
#> [86] "1 ABBR Madison BMD Records"
#> [87] "1 REPO @R1@"
#> [88] "2 CALN 13B-1234.01"
#> [89] "0 @R1@ REPO"
#> [90] "1 NAME Family History Library"
#> [91] "1 ADDR"
#> [92] "2 ADR1 35 N West Temple Street"
#> [93] "2 CITY Salt Lake City"
#> [94] "2 STAE Utah"
#> [95] "2 POST 84150"
#> [96] "2 CTRY United States of America"
#> [97] "0 TRLR"
Lines in a GEDCOM file can have a number of components:
There will not be any lines that have all of these components. For example, the first line of records do not contain a line value. Also, a line will never have a cross-reference pointer and a line value defined. For this reason, the tidyged
object treats cross-reference pointers as just another line value.
The tidyged
representation of the above GEDCOM file is given below, using a sample dataset built into the package:
level | record | tag | value |
---|---|---|---|
0 | HD | HEAD | |
1 | HD | GEDC | |
2 | HD | VERS | 5.5.5 |
2 | HD | FORM | LINEAGE-LINKED |
3 | HD | VERS | 5.5.5 |
1 | HD | CHAR | UTF-8 |
1 | HD | SOUR | GS |
2 | HD | NAME | GEDCOM Specification |
2 | HD | VERS | 5.5.5 |
2 | HD | CORP | gedcom.org |
3 | HD | ADDR | |
4 | HD | CITY | LEIDEN |
3 | HD | WWW | www.gedcom.org |
1 | HD | DATE | 2 OCT 2019 |
2 | HD | TIME | 0:00:00 |
1 | HD | FILE | 555Sample.ged |
1 | HD | LANG | English |
1 | HD | SUBM | @U1@ |
0 | @U1@ | SUBM | |
1 | @U1@ | NAME | Reldon Poulson |
1 | @U1@ | ADDR | |
2 | @U1@ | ADR1 | 1900 43rd Street West |
2 | @U1@ | CITY | Billings |
2 | @U1@ | STAE | Montana |
2 | @U1@ | POST | 68051 |
2 | @U1@ | CTRY | United States of America |
1 | @U1@ | PHON | +1 (406) 555-1232 |
0 | @I1@ | INDI | |
1 | @I1@ | NAME | Robert Eugene /Williams/ |
2 | @I1@ | SURN | Williams |
2 | @I1@ | GIVN | Robert Eugene |
1 | @I1@ | SEX | M |
1 | @I1@ | BIRT | |
2 | @I1@ | DATE | 2 OCT 1822 |
2 | @I1@ | PLAC | Weston, Madison, Connecticut, United States of America |
2 | @I1@ | SOUR | @S1@ |
3 | @I1@ | PAGE | Sec. 2, p. 45 |
1 | @I1@ | DEAT | |
2 | @I1@ | DATE | 14 APR 1905 |
2 | @I1@ | PLAC | Stamford, Fairfield, Connecticut, United States of America |
1 | @I1@ | BURI | |
2 | @I1@ | PLAC | Spring Hill Cemetery, Stamford, Fairfield, Connecticut, United States of America |
1 | @I1@ | FAMS | @F1@ |
1 | @I1@ | FAMS | @F2@ |
1 | @I1@ | RESI | |
2 | @I1@ | DATE | FROM 1900 TO 1905 |
0 | @I2@ | INDI | |
1 | @I2@ | NAME | Mary Ann /Wilson/ |
2 | @I2@ | SURN | Wilson |
2 | @I2@ | GIVN | Mary Ann |
1 | @I2@ | SEX | F |
1 | @I2@ | BIRT | |
2 | @I2@ | DATE | BEF 1828 |
2 | @I2@ | PLAC | Connecticut, United States of America |
1 | @I2@ | FAMS | @F1@ |
0 | @I3@ | INDI | |
1 | @I3@ | NAME | Joe /Williams/ |
2 | @I3@ | SURN | Williams |
2 | @I3@ | GIVN | Joe |
1 | @I3@ | SEX | M |
1 | @I3@ | BIRT | |
2 | @I3@ | DATE | 11 JUN 1861 |
2 | @I3@ | PLAC | Idaho Falls, Bonneville, Idaho, United States of America |
1 | @I3@ | FAMC | @F1@ |
1 | @I3@ | FAMC | @F2@ |
2 | @I3@ | PEDI | adopted |
1 | @I3@ | ADOP | |
2 | @I3@ | DATE | 16 MAR 1864 |
0 | @F1@ | FAM | |
1 | @F1@ | HUSB | @I1@ |
1 | @F1@ | WIFE | @I2@ |
1 | @F1@ | CHIL | @I3@ |
1 | @F1@ | MARR | |
2 | @F1@ | DATE | DEC 1859 |
2 | @F1@ | PLAC | Rapid City, Pennington, South Dakota, United States of America |
0 | @F2@ | FAM | |
1 | @F2@ | HUSB | @I1@ |
1 | @F2@ | CHIL | @I3@ |
0 | @S1@ | SOUR | |
1 | @S1@ | DATA | |
2 | @S1@ | EVEN | BIRT, DEAT, MARR |
3 | @S1@ | DATE | FROM JAN 1820 TO DEC 1825 |
3 | @S1@ | PLAC | Madison, Connecticut, United States of America |
2 | @S1@ | AGNC | Madison County Court |
1 | @S1@ | TITL | Madison County Birth, Death, and Marriage Records |
1 | @S1@ | ABBR | Madison BMD Records |
1 | @S1@ | REPO | @R1@ |
2 | @S1@ | CALN | 13B-1234.01 |
0 | @R1@ | REPO | |
1 | @R1@ | NAME | Family History Library |
1 | @R1@ | ADDR | |
2 | @R1@ | ADR1 | 35 N West Temple Street |
2 | @R1@ | CITY | Salt Lake City |
2 | @R1@ | STAE | Utah |
2 | @R1@ | POST | 84150 |
2 | @R1@ | CTRY | United States of America |
0 | TR | TRLR |
The cross-reference identifier is given in the record column. Event though these aren’t given for the header and trailer records they are assigned as “HD” and “TR” respectively in the tidyged
object.
GEDCOM files can be imported and exported using the tidyged.io
package. Many GEDCOM processors modify the data on import, perhaps ignoring custom tags. This package reads the file as-is, and the only modification of the file is in line with that recommended in the GEDCOM 5.5.5 specification, specifically; replacing double ‘@’ symbols with single ones, merging CONC/CONT lines, and ensuring appropriate capitalisation of certain values. In general, the package sticks rigidly to the GEDCOM specification and does not use custom tags. There is a real issue with other desktop applications creating their own flavours of file and diverging from the specification; tidyged
does not and will never do this.
If you want to modify single values in a GEDCOM or do a simple Find/Replace, then this is straightforward to achieve with a simple text editor.