The GEDCOM Standard Release 5.5
Chapter 1
Data Representation Grammar
Introduction
This chapter describes the core GEDCOM data representation
language.
The generic data representation language defined in this
chapter may be used to represent any form of structured
information, not just genealogical data, using a sequential
stream of characters.
Concepts
A GEDCOM transmission represents a database in the form of a
sequential stream of related records. A record is represented as
a sequence of tagged, variable-length lines, arranged in a
hierarchy. A line always contains a hierarchical level number, a
tag, and an optional value. A line may also contain a
cross-reference identifier or a pointer. The GEDCOM line is
terminated by a carriage return, a line feed character, or any
combination of these.
The tag in the GEDCOM line, taken in its hierarchial context,
identifies the information contained in the line, in the same
sense that a field-name identifies a field in a database record.
This means that the data is self-defining. Tags allow a field to
occur any number of times within a record, including zero times.
They also allow the use of different or new fields to be included
in the GEDCOM data without introducing incompatibility, because
the receiving system will ignore data which it does not
understand and process only the data that it does understand.
The hierarchical relationships are indicated by a level
number. Subordinate lines have a higher level number. The
hierarchy allows a line to have sub-lines, which in turn may have
their own sub-lines, and so forth. A line and its sub-lines
constitute a context or enclosure, that is, a cluster of
information pertaining directly to the same thing. This
hierarchical arrangement corresponds with the natural hierarchy
found in most structured information.
A series of one or more lines constitutes a record. The
beginning of a new record is indicated by a line whose level
number is 0 (zero).
In addition to hierarchical relationships, GEDCOM defines the
inter-record relationships that allow a record to be logically
related to other records, without introducing redundancy. These
relationships are represented by two additional, but optional,
parts of a line: a cross-reference pointer and a cross-reference
identifier. The cross-reference pointer "points at" a related
record, which is identified by a required, matching unique
cross-reference identifier. The cross-reference identifier is
analogous to a primary key in relational database
terminology.
Grammar
This chapter defines the grammar for the GEDCOM format. The
grammar is a set of rules that specify the character sequences
that are valid for creating the expression of the GEDCOM line.
The character sequences are described in terms of various
combinations of elements (variables and/or constants). Elements
may be described in terms of a set of other elements, some of
which are selected from a set of alternative elements. Each
element in the definition is separated by a plus sign (+)
signifying that both elements are required. When there is a
choice of different elements that can be used, the set of
alternatives are listed between opening and closing square
brackets ([]), with each choice separated by a vertical bar
([alternative_1 | alternative_2]). The user can read the grammar
components of the selected element by substituting any
sub-elements until all sub-elements have been resolved.
A GEDCOM transmission consists of a sequence of logical
records, each of which consists of a sequence of
gedcom_lines, all contained in a sequential file or stream
of characters. The following rules pertain to the
gedcom_line:
- Long values can be broken into shorter GEDCOM lines by using
a subordinate CONC or CONT tag. The CONC tag assumes that the
accompanying subordinate value is concatenated to the previous
line value without saving the carriage return prior to the line
terminator. The CONT assumes that the subordinate line value is
concatenated to the previous line, saving the carriage
return.
- The beginning of a new logical record is designated by a line
whose level number is 0 (zero).
- Each new level number must be no higher than the previous
line plus 1.
- Logical GEDCOM record sizes should be constrained so that
they will fit in a memory buffer of less than 32K. GEDCOM files
with records sizes greater than 32K run the risk of not being
able to be loaded in some programs. Use of pointers to records,
particularly NOTE records, should ensure that this limit will be
sufficient. The size of embedded multimedia records can be
controlled through chaining MULTIMEDIA_RECORDS.
- Any length constraints are given in characters, not bytes.
When wide characters (characters wider than 8 bits) are used,
byte buffer lengths should be adjusted accordingly.
- Level numbers must be between 1 to 99 and must not contain
leading zeroes, for example, level one must be 1, not 01.
- The cross-reference ID has a maximum of 22 characters,
including the enclosing at signs (@), and it must be unique
within the GEDCOM transmission.
- Pointers to records imply that the record pointed to does
actually exists within the transmission. Future pointer
structures may allow pointing to records within a public
accessible database as an alternative.
- The length of the GEDCOM TAG is a maximum of 31 characters,
with the first 15 characters being unique.
- The total length of a GEDCOM line, including leading white
space, level number, cross-reference number, tag, value,
delimiters, and terminator, must not exceed 255 (wide)
characters.
- Leading white space (tabs, spaces, and extra line
terminators) preceding a GEDCOM line should be ignored by the
reading system. Systems generating GEDCOM should not place any
white space in front of the GEDCOM line.
Grammar Syntax
A gedcom_line has the following syntax:
gedcom_line :=
level + delim + [xref_id + delim +] tag + [delim + line_value
+] terminator
level + delim + optional_xref_id + tag + delim +
optional_line_value + terminator
for example:
1 NAME Will /Rogers/
The components used in the pattern above are defined below in
alphabetical order. Some of the components are defined in terms
of other primitive patterns. The spaces used in the patterns
below are only to set them apart and are not a part of the
resulting pattern. Character constants are specified in the hex
form (0x20) which is the ASCII hex value of a space character.
Character constants that are separated by a (-) dash represent
any character with in that range from the first constant shown to
and including the second constant shown.
alpha :=
[(0x41)-(0x5A) | (0x61)-(0x7A) | (0x5F) ]
where:
(0x41)-(0x5A)=A to Z
(0x61)-(0x7A)=a to z
(0x5F)=(_) underscore
alphanum :=
[ alpha | digit ]
any_char :=
[alpha | digit | otherchar | (0x23) | (0x20) | (0x40)+(0x40) ]
where:
(0x23)=#
(0x20)=space character
(0x40)+(0x40)=@@
delim :=
[(0x20) ]
where:
(0x20)=space_character
digit :=
[(0x30)-(0x39) ]
where:
(0x30)-(0x39) = One of the digits 0, 1,2,3,4,5,6,7,8,9
escape :=
[(0x40) + (0x23) + escape_text + (0x40) + non_at ]
where:
(0x40)=@
(0x23)=#
escape_text :=
[any_char | escape_text + any_char ]
The escape_text is coded to meet the rules of a particular GEDCOM
form.
level :=
[digit | level + digit ]
(Do not use non-significant leading zeroes such as 02.)
line_item :=
[any_char | escape | line_item + any_char | line_item + escape]
line_value :=
[ pointer | line_item ]
non_at :=
[alpha | digit | otherchar | (0x23) | (0x20 ) ]
where:
(0x20)=space character
(0x23)=#
null := nothing
optional_line_value := line_value + delim
optional_xref_Id := xref_Id + delim
otherchar :=
[(0x21)-(0x22) | (0x24)-(0x2F) | (0x3A)-(0x3F) | (0x5B)-(0x5E) |
(0x60) | (0x7B)-(0x7E) | (0x80)-(0xFE)]
where, respectively:
(0x21)-(0x22)=! "
(0x24)-(0x2F)=$ & ' ( ) * + , - . /
(0x3A)-(0x3F)=: ; < = > ?
(0x5B)-(0x5E)=[ \ ] ^
(0x60)=`
(0x7B)-(0x7E)={ | } ~
(0x80)-(0xFE)=ANSEL characters above 127
Any 8-bit ASCII character except control characters (0x00%0x1F),
alphanum, space ( ), number sign (#), at sign (@), _ underscore,
and the DEL character (0x7F).
pointer :=
[(0x40) + alphanum + pointer_string + (0x40) ]
where:
(0x40)=@
pointer_char :=
[non_at ]
pointer_string :=
[null | pointer_char | pointer_string + pointer_char ]
tag :=
[alphanum | tag + alphanum ]
terminator :=
[carriage_return | line_feed | carriage_return + line_feed |
line_feed + carriage_return ]
xref_id :=
[pointer]
Description of Grammar Components
alpha :=
The alpha characters include the underscore, which is used to
link word pieces together in forming tag names or tag labels.
any_char :=
Any 8-bit ASCII character except the control characters found in
the range of 0x00%0x1F and 0x7F. If an @ is desired as part of
the line_value, it must be written in GEDCOM as a double @, i.e.,
"3 doz. @ $20.00" must be stored as "3 doz. @@ $20.00."
delim :=
The delim (delimiter), a single space character,
terminates both the variable-length level number and the
variable-length tag . Note that space characters may also
be present in a value .
escape :=
The escape is a character sequence in the grammar used to
specify special processing, such as for switching character sets
or for indicating an inclusion of a non-GEDCOM data form into the
GEDCOM structure. The form of the escape sequence is:
@+#+ escape_text +@+ non_at .
Receiving systems should discard any space character which
follows the escape sequences closing at-sign (@). If the
character following the escape sequence's closing at-sign (@) is
not aspace character then it should be kept as a part of the text
following the escape. Systems writing escape sequences should
always output a space character following the escape sequence.
The specific format of the escape sequence is defined for the
specific GEDCOM form being defined.
escape_text :=
The escape_text is defined to meet the requirements of a
particular GEDCOM form.
level :=
The level number works the same way as the level of
indentation in an indented outline, where indented lines provide
detail about the item under which they are indented. A line at
any level L is enclosed by and pertains directly to the
nearest preceding line at level L-1. The Level L may
increase by 1 at most. Level numbers must not contain leading
zeroes, for example level one must be (1), not (01).
The enclosed subordinate lines at level L are said to be in the
context of the enclosing superior line at level L-1. The
interpretation of a tag must be in the context of the
tag s of the enclosing line(s) rather than just the tag by
itself. Take the following record about an individual's birth and
death dates, for example:
0 INDI
1 BIRT
2 DATE 12 MAY 1920
1 DEAT
2 DATE 1960
In this example, the expression DATE 12 MAY 1920 is interpreted
within the INDI (individual) BIRT (birth) context, representing
the individual's birth date. The second DATE is in the INDI.DEAT
(individual's death) context. The complete meaning of DATE
depends on the context.
Note:The above example is indented according to the
level numbers to make the concept more obvious. In the actual
GEDCOM data, the level numbers are lined up vertically, meaning
they are the first character(s) of the GEDCOM line.
Some systems output indented GEDCOM data for better readability
by putting space or tab characters between the terminator and the
level number of the next line to visibly show the
hierarchy. Also, some people have suggested allowing extra blank
lines to visibly separate physical records. GEDCOM files produced
with these features are not to be used when transmitting GEDCOM
to other systems.
line_value :=
The line_value identifies an object within the domain of
possible values allowed in the context of the tag . The
combination of the tag , the line_value , and the
hierarchical context of the supporting gedcom_line s
provides the understanding of the enclosed value s. This
domain is defined by a specific grammar for representing a given
GEDCOM form. (See Chapter 2 for
Lineage-Linked GEDCOM Form grammar.)
Values whose source information contains illegible parts of the
value should be indicated by replacing the illegible part with an
ellipsis (...).
Values are generally not encoded in binary or other abbreviation
schemes for reducing space requirements, and they are generally
constrained to be understandable by a typical user without
decoding. This is intended to reduce the decoding burden on the
receiving software. A GEDCOM-optimized data compression standard
will be defined in the future to reduce space requirements.
Meanwhile, users may agree to compress and decompress GEDCOM
files using any compression system available to both sender and
receiver.
The line_value within the context of a tag hierarchy of
gedcom_line s represents one piece of information and
corresponds to one field in traditional database or file
terminology.
otherchar :=
Any 8-bit ASCII character except control characters (0x00%0x1F),
alphanum, space ( ), number-sign (#), the at sign (@), and
the DEL character (0x7F).
pointer :=
A pointer stands in the place of the context identified by
the matching xref_id . Theoretically, a receiving system
should be prepared to follow a pointer to find any
needed value in a manner that is transparent to the logic
of the subsystem that is looking for specific tag s. This
highly flexible facility will probably be used more in the
future. For the time being, however, the use of pointer s
is explicitly defined within the GEDCOM form, such as the
Lineage-Linked GEDCOM Form defined in Chapter 2).
The pointer represents the association between two objects
that usually reside in different records. Objects within a
logical record can be associated. If this need exists, the
pointer record composition contains an exclamation point (!) that
separates the parent record's cross-reference ID from the
specific substructure's cross-reference ID, which is at some
subordinate level to the logical record at level zero. The
cross-reference ID of the substructure subordinate to a zero
level record, for inter-record associations is always composed of
the Record ID number and the Substructure ID number, such as
@I132!1@. Including the Record ID number in the pointer that
associates objects within a record will allow the GEDCOM
processors to build the index only at the record level and then
search sequentially for the appropriate substructure
cross-reference ID. The parent record ID is assumed when the
cross-reference ID begins with a exclamation point (!) signifying
an intra-record association.
Complex logical record structures are divided into small physical
records to accommodate memory constraints, many-to-many
relationships, and independent record creation and deletion.
The pointer must match a corresponding unique
xref_id within the transmission, unless the colon (:)
character is present (which will be used in the future as a
network reference to a permanent file record). A pointer
is given instead of duplicating an object, though the logical
result is equivalent. An expanded traversal of a record tree
includes following the pointer to related records to some
depth, and splicing those records (logically) into the resultant
expanded tree. Pointers may refer to either records which
have not yet appeared in the transmission (forward reference) or
to records that have already appeared earlier in the
transmission(backward reference). This arrangement usually
requires a preliminary pass to construct a look up table to
support random access by xref_id during subsequent passes.
tag :=
A tag consists of a variable length sequence of
alphanum characters. All user-defined tags, tags used that
have not been defined in the GEDCOM standard, must begin with an
underscore character (0x95).
The tag represents the meaning of the line_value
within the context of the enclosing lines, and contributes to the
meaning of enclosed subordinate lines. Specific tag s are
defined in Appendix A. The presence
of a tag together with a value represents an assertion which the
submitter wishes to communicate to a receiver. A tag with no
value does not represent an assertion. If a tag is absent, no
assertion is made, for example, no information is submitted.
Information of a negative nature (such as knowing positively an
event did not occur) is handled through the semantic definition
of a tag and accompanying values that assert the information
explicitly. It is not represented by absence of a tag.
Although formally defined tag s are only three or four
characters long, systems should prepare to handle user tags of
greater length. Tags will be unique within the first 15
characters.
Valid combinations of specific tag s, line_value s,
xref_id s, and pointer s are constrained by the
GEDCOM form defined for representing a given kind of information.
(See Chapter 2 for the Lineage-Linked
GEDCOM Form grammar.)
terminator :=
The terminator delimits the variable-length
line_value and signals the end of the gedcom_line .
The valid terminator characters are:
[carriage_return |
line_feed |
carriage_return line_feed |
line_feed carriage_return ]
xref_id :=
(See pointer)
The xref_id is formed by any arbitrary combination of
characters from the pointer_char set. The first character
must be an alpha or a digit. The xref_id is not retained
in the receiving system, and it may therefore be formed from any
convenient combination of identifiers from the sending system. No
meaning is attributed by the receiver to any part of the
xref_id , other than its unique association with
the associated record. The use of the colon (:) character is also
reserved.
Examples:
The following are examples of valid but unrelated GEDCOM lines:
0 @1234@ INDI
. . .
1 AGE 13y
. . .
1 CHIL @1234@
. . .
1 NOTE This is a note field that is
2 CONT continued on the next line.
The first line has a level number 0, a xref_id of
@1234@, an INDI tag , and no value .
The second line has a level number 1, no xref_id ,
an AGE tag , and a value of 13.
The third line has a level number 1, no xref_id , a
CHIL tag , and a value of a pointer to a
xref_id named @1234@.