Why gedcomS7?
One of the main characteristics I wanted for this package was to hide the complexity of the GEDCOM specification, and try to automate genealogical tasks that are time consuming to do manually.
I spent a significant amount of time before writing any code considering my options for how the data would be stored under the hood. In one blog I considered storing genealogical data in a relational table format as it is easier to deal with, but discounted it very quickly as it is not well suited to nested data (and list columns are not easy to deal with).
I toyed with the idea of using an off-the-shelf open source product like GRAMPS but I found it awkward to use and wanted something where I was in complete control, taking full advantage of the strengths of R.
I also considered using data structures more suited to this type of
data, such as JSON or graphs (using the igraph
or
data.tree
package). However, I discovered it would be quite
difficult representing some of the structures in the GEDCOM
specification to my satisfaction.
My preference was for an Object Orientated approach, which is what many other applications use, but none of the OOP solutions in R quite fit the bill.
My first serious attempt at doing this resulted in
tidyged
and the other packages of the
gedcompendium
. This adopted a dataframe and
tidyverse
approach and whilst it did work it was
dependency-heavy and required a lot of processing under the hood.
The release of S7
presented an ideal opportunity to
revisit creating an OOP-based GEDCOM package in R. Initial testing had
promising results and so I continued building out the rest of the
package, using the updated GEDCOM 7.0 specification.
The gedcomS7 object
The main GEDCOM object is an S7 representation of a GEDCOM file:
library(gedcomS7)
ged <- read_gedcom("https://gedcom.io/testfiles/gedcom70/remarriage1.ged")
str(ged)
#> <gedcomS7::GedcomS7>
#> @ header : <gedcomS7::GedcomHeader>
#> .. @ gedcom_version : chr "7.0"
#> .. @ ext_tags : chr(0)
#> .. @ source : NULL
#> .. @ destination : chr(0)
#> .. @ creation_date : <gedcomS7::DateExact>
#> .. .. @ year : int 2025
#> .. .. @ month : int 1
#> .. .. @ day : int 26
#> .. .. @ GEDCOM_STRING: chr "26 JAN 2025"
#> .. .. @ as_date : Date[1:1], format: "2025-01-26"
#> .. @ creation_time : chr(0)
#> .. @ subm_xref : chr(0)
#> .. @ gedcom_copyright : chr(0)
#> .. @ default_language : chr(0)
#> .. @ default_place_form: chr(0)
#> .. @ notes : list()
#> .. @ note_xrefs : chr(0)
#> .. @ GEDCOM : chr [1:4] "0 HEAD" "1 GEDC" "2 VERS 7.0" "1 DATE 26 JAN 2025"
#> @ records : <gedcomS7::GedcomRecords>
#> .. @ prefixes : Named chr [1:7] "U" "I" "F" "S" "R" "M" "N"
#> .. .. - attr(*, "names")= chr [1:7] "SUBM" "INDI" "FAM" "SOUR" ...
#> .. @ XREFS :List of 7
#> .. .. $ SUBM : chr(0)
#> .. .. $ INDI : chr [1:3] "@I1@" "@I2@" "@I3@"
#> .. .. $ FAM : chr [1:2] "@F1@" "@F2@"
#> .. .. $ SOUR : chr(0)
#> .. .. $ REPO : chr(0)
#> .. .. $ OBJE : chr(0)
#> .. .. $ SNOTE: chr(0)
#> .. @ XREFS_PRIV :List of 7
#> .. .. $ SUBM : chr(0)
#> .. .. $ INDI : chr(0)
#> .. .. $ FAM : chr(0)
#> .. .. $ SOUR : chr(0)
#> .. .. $ REPO : chr(0)
#> .. .. $ OBJE : chr(0)
#> .. .. $ SNOTE: chr(0)
#> .. @ XREFS_CONFID:List of 7
#> .. .. $ SUBM : chr(0)
#> .. .. $ INDI : chr(0)
#> .. .. $ FAM : chr(0)
#> .. .. $ SOUR : chr(0)
#> .. .. $ REPO : chr(0)
#> .. .. $ OBJE : chr(0)
#> .. .. $ SNOTE: chr(0)
#> .. @ XREFS_NEXT : Named chr [1:7] "@U1@" "@I4@" "@F3@" "@S1@" "@R1@" "@M1@" "@N1@"
#> .. .. - attr(*, "names")= chr [1:7] "SUBM" "INDI" "FAM" "SOUR" ...
#> .. @ RAW : <gedcomS7::GedcomRecordsRaw>
#> .. .. @ SUBM : Named list()
#> .. .. @ INDI :List of 3
#> .. .. .. $ @I1@: chr [1:5] "0 @I1@ INDI" "1 NAME John Q /Public/" "1 SEX M" "1 FAMS @F1@" ...
#> .. .. .. $ @I2@: chr [1:4] "0 @I2@ INDI" "1 NAME Jane /Doe/" "1 SEX F" "1 FAMS @F1@"
#> .. .. .. $ @I3@: chr [1:5] "0 @I3@ INDI" "1 NAME Mary /Roe/" "1 DEAT" "2 DATE 1 MAR 1914" ...
#> .. .. @ FAM :List of 2
#> .. .. .. $ @F1@: chr [1:9] "0 @F1@ FAM" "1 HUSB @I1@" "1 WIFE @I2@" "1 MARR" ...
#> .. .. .. $ @F2@: chr [1:5] "0 @F2@ FAM" "1 HUSB @I1@" "1 WIFE @I3@" "1 MARR" ...
#> .. .. @ SOUR : Named list()
#> .. .. @ REPO : Named list()
#> .. .. @ OBJE : Named list()
#> .. .. @ SNOTE: Named list()
#> @ update_change_dates: logi FALSE
#> @ add_creation_dates : logi FALSE
#> @ GEDCOM : chr [1:33] "0 HEAD" "1 GEDC" "2 VERS 7.0" "1 DATE 26 JAN 2025" ...
Properties of the GEDCOM object can be accessed and modified using
the @
operator, e.g.
ged@header@default_language
#> character(0)
ged@header@default_language <- "en"
This ease of modification of specific properties of a GEDCOM object
wasn’t possible with the tidyged
package.
Properties which don’t have values are either empty vectors, empty
lists, or NULL (depending on the property). Many properties take values
which are particular gedcomS7
objects (or lists of them, if
they take more than one). For ease of use, you are often permitted to
provide a simple atomic vector, and gedcomS7
will convert
these to their relevant objects automatically, but only if the object
only requires its first property. For example, the @notes
property can take a simple character vector of notes, and these will be
converted to a list of Note()
objects.
ged@header@notes <- c("This is a note", "This is another note")
str(ged@header@notes)
#> List of 2
#> $ : <gedcomS7::Note>
#> ..@ text : chr "This is a note"
#> ..@ language : chr(0)
#> ..@ media_type : chr(0)
#> ..@ translations: list()
#> ..@ citations : list()
#> ..@ GEDCOM : chr "0 NOTE This is a note"
#> $ : <gedcomS7::Note>
#> ..@ text : chr "This is another note"
#> ..@ language : chr(0)
#> ..@ media_type : chr(0)
#> ..@ translations: list()
#> ..@ citations : list()
#> ..@ GEDCOM : chr "0 NOTE This is another note"
You can then access properties of these:
ged@header@notes[[2]]@language <- "en"
Some of these properties are also read-only (calculated from other
properties), such as @GEDCOM
and these are indicated by
being in all capitals. The exploration of all properties of the GEDCOM
object is beyond the scope of this article, however the implementation
of how properties are stored is important.
The Push/Pull paradigm
A GEDCOM file could contain many thousands of records containing
information on individuals, families, notes, sources, etc. Whilst
storing each of these records as S7
objects within the main
GEDCOM object is theoretically possible, in practice it very quickly
eats up too much memory rendering the idea a non-starter.
For this reason, records are stored in their raw form from the GEDCOM
file as lists of character vectors in the @RAW
property of
the @records
property. For example, the lines in the GEDCOM
file for the first individual are:
ged@records@RAW@INDI[[1]]
#> [1] "0 @I1@ INDI" "1 NAME John Q /Public/" "1 SEX M"
#> [4] "1 FAMS @F1@" "1 FAMS @F2@"
You can also reference records by their xref:
ged@records@RAW@INDI[["@I1@"]]
#> [1] "0 @I1@ INDI" "1 NAME John Q /Public/" "1 SEX M"
#> [4] "1 FAMS @F1@" "1 FAMS @F2@"
If you want to edit a record, you must first Pull it from the GEDCOM
object. This takes a copy of the record and parses it into an editable
S7
object.
john_public <- pull_record(ged, "@I1@")
str(john_public, max.level = 1)
#> <gedcomS7::IndividualRecord>
#> @ xref : chr "@I1@"
#> @ confidential : logi FALSE
#> @ locked : logi FALSE
#> @ private : logi FALSE
#> @ user_ids : chr(0)
#> @ unique_ids : chr(0)
#> @ ext_ids : chr(0)
#> @ note_xrefs : chr(0)
#> @ notes : list()
#> @ citations : list()
#> @ media_links : list()
#> @ created : NULL
#> @ updated : NULL
#> @ RESTRICTIONS : chr(0)
#> @ GEDCOM_IDENTIFIERS: chr(0)
#> @ pers_names :List of 1
#> @ sex : chr "M"
#> @ facts : list()
#> @ non_events : list()
#> @ ordinances : list()
#> @ fam_links_chil : list()
#> @ fam_links_spou :List of 2
#> @ subm_xrefs : chr(0)
#> @ associations : list()
#> @ alia_xrefs : chr(0)
#> @ anci_xrefs : chr(0)
#> @ desi_xrefs : chr(0)
#> @ PRIMARY_NAME : chr "John Q Public"
#> @ ALL_NAMES : chr "John Q Public"
#> @ BIRTH_DATE : chr(0)
#> @ BIRTH_PLACE : chr(0)
#> @ IS_ALIVE : logi TRUE
#> @ DEATH_DATE : chr(0)
#> @ DEATH_PLACE : chr(0)
#> @ GEDCOM : chr [1:5] "0 @I1@ INDI" "1 NAME John Q /Public/" "1 SEX M" ...
We can edit a property and Push it back to the GEDCOM object:
john_public@notes <- "John once had a dog called Rover"
ged <- push_record(ged, john_public)
ged@records@RAW@INDI[[1]]
#> [1] "0 @I1@ INDI"
#> [2] "1 NAME John Q /Public/"
#> [3] "1 SEX M"
#> [4] "1 FAMS @F1@"
#> [5] "1 FAMS @F2@"
#> [6] "1 NOTE John once had a dog called Rover"
You should never attempt to modify the records in their character vector form directly from the GEDCOM object - additional checks and automated tasks are carried out to ensure self-consistency during the Push process.
Viewing gedcomS7
objects
There are two ways to view the contents of a gedcomS7
object - the most comprehensive is to use str()
(as above)
which will show you every property the object has. The alternative is to
use print()
or summary()
, both of which will
provide a brief summary of the object (not every property may be
displayed). For example:
ged
#> GEDCOM file summary:
#>
#> GEDCOM version: 7.0
#> Creation Date: 26 JAN 2025
#> Default Language: en
#> Source system: <Undefined>
#>
#> Copyright: <Undefined>
#>
#> Submitters: 0
#> Individuals: 3
#> Families: 2
#> Sources: 0
#> Repositories: 0
#> Multimedia: 0
#> Notes: 0
ged@header@notes
#> [[1]]
#> Note: This is a note
#>
#> Language: <Undefined>
#> Format: <Undefined>
#> Translations: 0
#> Citations: 0
#>
#> [[2]]
#> Note: This is another note
#>
#> Language: en
#> Format: <Undefined>
#> Translations: 0
#> Citations: 0
john_public@pers_names
#> [[1]]
#> Personal Name: John Q /Public/
#> Name Type: <Undefined>
#>
#> Translations: 0
#> Citations: 0
#> Notes: 0