Skip to content

vcf

Classes for generating VCF and records for testing

This module contains utility classes for the generation of VCF files and variant records, for use in testing.

The module contains the following public classes:

  • VariantBuilder() -- A builder class that allows the accumulation of variant records and access as a list and writing to file.

Examples

Typically, we have pysam.VariantRecord records obtained from reading from a VCF file. The VariantBuilder() class builds such records.

Variants are added with the add() method, which returns a pysam.VariantRecord.

>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
...     qual=40, filter=["PASS"]
... )

VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs

>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
... )

The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the to_path() method:

>>> from pathlib import Path
>>> path_to_vcf: Path = builder.to_path()  

The variants may also be retrieved in the order they were added via the to_unsorted_list() method and in coordinate sorted order via the to_sorted_list() method.

Functions

reader

reader(path: VcfPath) -> Generator[VariantFile, None, None]

Opens the given path for VCF reading

Parameters:

Name Type Description Default
path VcfPath

the path to a VCF, or an open file handle

required
Source code in fgpyo/vcf/__init__.py
@contextmanager
def reader(path: VcfPath) -> Generator[VcfReader, None, None]:
    """Opens the given path for VCF reading

    Args:
        path: the path to a VCF, or an open file handle
    """
    if isinstance(path, (str, Path, TextIO)):
        with fgpyo.io.suppress_stderr():
            # to avoid spamming log about index older than vcf, redirect stderr to /dev/null: only
            # when first opening the file
            _reader = VariantFile(path, mode="r")  # type: ignore[arg-type]
        # now stderr is back, so any later stderr messages will go through
        yield _reader
        _reader.close()
    else:
        raise TypeError(f"Cannot open '{type(path)}' for VCF reading.")

writer

writer(path: VcfPath, header: VariantHeader) -> Generator[VariantFile, None, None]

Opens the given path for VCF writing.

Parameters:

Name Type Description Default
path VcfPath

the path to a VCF, or an open filehandle

required
header VariantHeader

the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header

required
Source code in fgpyo/vcf/__init__.py
@contextmanager
def writer(path: VcfPath, header: VariantHeader) -> Generator[VcfWriter, None, None]:
    """Opens the given path for VCF writing.

    Args:
        path: the path to a VCF, or an open filehandle
        header: the source for the output VCF header. If you are modifying a VCF file that you are
                reading from, you can pass reader.header
    """
    # Convert Path to str such that pysam will autodetect to write as a gzipped file if provided
    # with a .vcf.gz suffix.
    if isinstance(path, Path):
        path = str(path)
    _writer = VariantFile(path, header=header, mode="w")
    yield _writer
    _writer.close()

Modules

builder

Classes for generating VCF and records for testing

Classes

VariantBuilder

Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.

Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.

A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be provided to the VariantBuilder constructor.

Variants are then added using the add() method. Once accumulated the variants can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). Lastly, the records can be written to a temporary file using to_path().

Attributes:

Name Type Description
sample_ids List[str]

the sample name(s)

sd Dict[str, Dict[str, Any]]

sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as contig_name) and "length", the contig length. Other values will be added to the VCF header line for that contig.

seq_idx_lookup Dict[str, int]

dictionary mapping contig name to index of contig in sd

records List[VariantRecord]

the list of variant records

header VariantHeader

the pysam header

Source code in fgpyo/vcf/builder.py
class VariantBuilder:
    """
    Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF
    can be sites-only, single-sample, or multi-sample.

    Provides the ability to manufacture variants from minimal arguments, while generating
    any remaining attributes to ensure a valid variant.

    A builder is constructed with a handful of defaults including the sample name and sequence
    dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be
    provided to the VariantBuilder constructor.

    Variants are then added using the [`add()`][fgpyo.vcf.builder.VariantBuilder.add]
    method.
    Once accumulated the variants can be accessed in the order in which they were created through
    the [`to_unsorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_sorted_list]. Lastly, the
    records can be written to a temporary file using
    [`to_path()`][fgpyo.vcf.builder.VariantBuilder.to_path].

    Attributes:
        sample_ids: the sample name(s)
        sd: sequence dictionary, implemented as python dict from contig name to dictionary with
            contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as
            contig_name) and "length", the contig length. Other values will be added to the VCF
            header line for that contig.
        seq_idx_lookup: dictionary mapping contig name to index of contig in sd
        records: the list of variant records
        header: the pysam header
    """

    sample_ids: List[str]
    sd: Dict[str, Dict[str, Any]]
    seq_idx_lookup: Dict[str, int]
    records: List[VariantRecord]
    header: VariantHeader

    def __init__(
        self,
        sample_ids: Optional[Iterable[str]] = None,
        sd: Optional[Dict[str, Dict[str, Any]]] = None,
    ) -> None:
        """Initializes a new VariantBuilder for generating variants and VCF files.

        Args:
            sample_ids: the name of the sample(s)
            sd: optional sequence dictionary
        """
        self.sample_ids: List[str] = list(sample_ids) if sample_ids is not None else []
        self.sd: Dict[str, Dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
        self.seq_idx_lookup: Dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
        self.records: List[VariantRecord] = []
        self.header = VariantHeader()
        for line in VariantBuilder._build_header_string(sd=self.sd):
            self.header.add_line(line)
        if sample_ids is not None:
            self.header.add_samples(sample_ids)

    @classmethod
    def default_sd(cls) -> Dict[str, Dict[str, Any]]:
        """Generates the sequence dictionary that is used by default by VariantBuilder.
        Re-uses the dictionary from SamBuilder for consistency.

        Returns:
            A new copy of the sequence dictionary as a map of contig name to dictionary, one per
            contig.
        """
        sd: Dict[str, Dict[str, Any]] = {}
        for sequence in SamBuilder.default_sd():
            contig = sequence["SN"]
            sd[contig] = {"ID": contig, "length": sequence["LN"]}
        return sd

    @classmethod
    def _build_header_string(cls, sd: Optional[Dict[str, Dict[str, Any]]] = None) -> Iterator[str]:
        """Builds the VCF header with the given sample name(s) and sequence dictionary.

        Args:
            sd: the sequence dictionary mapping the contig name to the key-value pairs for the
                given contig.  Must include "ID" and "length" for each contig.  If no sequence
                dictionary is given, will use the default dictionary.
        """
        if sd is None:
            sd = VariantBuilder.default_sd()
        # add mandatory VCF format
        yield "##fileformat=VCFv4.2"
        # add GT
        yield '##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">'
        # add additional common INFO lines
        yield '##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield (
            '##INFO=<ID=AR,Number=A,Type=Float,Description="Allele Ratio - ratio of AD for allele'
            ' vs. AD for modal allele.">'
        )
        yield '##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'
        # add additional common FORMAT lines
        yield (
            '##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt'
            ' alleles in the order listed">'
        )
        yield '##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield '##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'

        for d in sd.values():
            if "ID" not in d or "length" not in d:
                raise ValueError(
                    "Sequence dictionary must include 'ID' and 'length' for each contig."
                )
            contig_id = d["ID"]
            contig_length = d["length"]
            contig_header = f"##contig=<ID={contig_id},length={contig_length}"
            for key, value in d.items():
                if key == "ID" or key == "length":
                    continue
                contig_header += f",{key}={value}"
            contig_header += ">"
            yield contig_header

    @property
    def num_samples(self) -> int:
        return len(self.sample_ids)

    def add(
        self,
        contig: Optional[str] = None,
        pos: int = 1000,
        end: Optional[int] = None,
        id: str = ".",
        ref: str = "A",
        alts: Union[None, str, Iterable[str]] = (".",),
        qual: int = 60,
        filter: Union[None, str, Iterable[str]] = None,
        info: Optional[Dict[str, Any]] = None,
        samples: Optional[Dict[str, Dict[str, Any]]] = None,
    ) -> VariantRecord:
        """Generates a new variant and adds it to the internal collection.

        Notes:
        * Very little validation is done with respect to INFO and FORMAT keys being defined in the
        header.
        * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
        VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
        the property that should be accessed when using the records produced by this function (not
        "start").

        Args:
            contig: the chromosome name. If None, will use the first contig in the sequence
                    dictionary.
            pos: the 1-based position of the variant
            end: an optional 1-based inclusive END position; if not specified a value will be looked
                 for in info["END"], or calculated from the length of the reference allele
            id: the variant id
            ref: the reference allele
            alts: the list of alternate alleles, None if no alternates. If a single string is
                  passed, that will be used as the only alt.
            qual: the variant quality
            filter: the list of filters, None if no filters (ex. PASS). If a single string is
                    passed, that will be used as the only filter.
            info: the dictionary of INFO key-value pairs
            samples: the dictionary from sample name to FORMAT key-value pairs.
                     if a sample property is supplied for any sample but omitted in some, it will
                     be set to missing (".") for samples that don't have that property explicitly
                     assigned. If a sample in the VCF is omitted, all its properties will be set to
                     missing.
        """
        if contig is None:
            contig = next(iter(self.sd.keys()))

        if contig not in self.sd:
            raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
        # because there are a lot of slightly different objects related to samples or called
        # "samples" in this function, we alias samples to sample_formats
        # we still want to keep the API labeled "samples" because that keeps the naming scheme the
        # same as the pysam API
        sample_formats = samples
        if sample_formats is not None:
            unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
            if len(unknown_samples) > 0:
                raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

        if isinstance(alts, str):
            alts = (alts,)
        alleles = (ref,) if alts is None else (ref, *alts)
        if isinstance(filter, str):
            filter = (filter,)

        # pysam expects a list of format dicts provided in the same order as the samples in the
        # header (self.sample_ids). (This is despite the fact that it will internally represent the
        # values as a map from sample ID to format values, as we do in this function.)
        # Convert to that form and rename to record_samples; to a) disambiguate from the input
        # values, and b) prevent mypy from complaining about the type changing from dict to list.
        if self.num_samples == 0:
            # this is a sites-only VCF
            record_samples = None
        elif sample_formats is None or len(sample_formats) == 0:
            # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
            # no fields)
            record_samples = None
        else:
            # convert to list form that pysam expects, in order pysam expects
            # note: the copy {**format_dict} below is present because pysam actually alters the
            # input values, which would be an unintended side-effect (in fact without this, tests
            # fail because the expected input values are changed)
            record_samples = [
                {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
            ]

        variant = self.header.new_record(
            contig=contig,
            start=pos - 1,  # start is 0-based
            stop=self._compute_and_check_end(pos, ref, end, info),
            id=id,
            alleles=alleles,
            qual=qual,
            filter=filter,
            info=info,
            samples=record_samples,
        )

        self.records.append(variant)
        return variant

    def _compute_and_check_end(
        self, pos: int, ref: str, end: Optional[int], info: Optional[dict[str, Any]]
    ) -> int:
        """
        Derives the END/stop position for a new record based on the optionally provided `end`
        parameter, the presence/absence of END in the info dictionary and/or the length of the
        reference allele.

        Also checks that any given or calculated end position is at least greater than or equal
        to the record's position.

        Args:
            pos: the 1-based position of the record
            ref: the reference allele of the record
            end: the provided 1-based end position if one was given
            info: the info dictionary if one was given
        """
        if end is not None and info is not None and "END" in info:
            raise ValueError(f"Two end positions given; end={end} and info.END={info['END']}")
        elif end is None:
            if info is not None and "END" in info:
                end = int(info["END"])
            else:
                end = pos + len(ref) - 1

        if end < pos:
            raise ValueError(f"Invalid end position, {end}, given for variant as pos {pos}.")

        return end

    def to_path(self, path: Optional[Path] = None) -> Path:
        """
        Returns a path to a VCF for variants added to this builder.

        If the path given ends in ".gz" then the generated file will be bgzipped and
        a tabix index generated for the file with the suffix ".gz.tbi".

        Args:
            path: optional path to the VCF
        """
        # update the path
        path = self._to_vcf_path(path)

        # Create a writer and write to it
        with PysamWriter(path, header=self.header) as writer:
            for variant in self.to_sorted_list():
                writer.write(variant)

        if str(path.suffix) == ".gz":
            pysam.tabix_index(str(path), preset="vcf", force=True)

        return path

    @staticmethod
    def _to_vcf_path(path: Optional[Path]) -> Path:
        """Gets the path to a VCF file.  If path is a directory, a temporary VCF will be created in
        that directory. If path is `None`, then a temporary VCF will be created.  Otherwise, the
        given path is simply returned.

        Args:
            path: optionally the path to the VCF, or a directory to create a temporary VCF.
        """
        if path is None:
            with NamedTemporaryFile(suffix=".vcf.gz", delete=False) as fp:
                path = Path(fp.name)
            assert path.is_file()
        return path

    def to_unsorted_list(self) -> List[VariantRecord]:
        """Returns the accumulated records in the order they were created."""
        return list(self.records)

    def to_sorted_list(self) -> List[VariantRecord]:
        """Returns the accumulated records in coordinate order."""
        return sorted(self.records, key=self._sort_key)

    def _sort_key(self, variant: VariantRecord) -> Tuple[int, int, int]:
        return self.seq_idx_lookup[variant.contig], variant.start, variant.stop

    def add_header_line(self, line: str) -> None:
        """Adds a header line to the header"""
        self.header.add_line(line)

    def add_info_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: Union[int, VcfFieldNumber] = 1,
        description: Optional[str] = None,
        source: Optional[str] = None,
        version: Optional[str] = None,
    ) -> None:
        """Add an INFO header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
            source: the source of the field
            version: the version of the field
        """
        if field_type == VcfFieldType.FLAG:
            num = "0"  # FLAGs always have number = 0
        elif isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        if source is not None:
            header_line += f",Source={source}"
        if version is not None:
            header_line += f",Version={version}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_format_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: Union[int, VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES,
        description: Optional[str] = None,
    ) -> None:
        """
        Add a FORMAT header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
        """
        if isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_filter_header(
        self,
        name: str,
        description: Optional[str] = None,
    ) -> None:
        """
        Add a FILTER header field to the VCF header.

        Args:
            name: the name of the field
            description: the description of the field
        """
        header_line = f"##FILTER=<ID={name}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)
Functions
__init__
__init__(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None) -> None

Initializes a new VariantBuilder for generating variants and VCF files.

Parameters:

Name Type Description Default
sample_ids Optional[Iterable[str]]

the name of the sample(s)

None
sd Optional[Dict[str, Dict[str, Any]]]

optional sequence dictionary

None
Source code in fgpyo/vcf/builder.py
def __init__(
    self,
    sample_ids: Optional[Iterable[str]] = None,
    sd: Optional[Dict[str, Dict[str, Any]]] = None,
) -> None:
    """Initializes a new VariantBuilder for generating variants and VCF files.

    Args:
        sample_ids: the name of the sample(s)
        sd: optional sequence dictionary
    """
    self.sample_ids: List[str] = list(sample_ids) if sample_ids is not None else []
    self.sd: Dict[str, Dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
    self.seq_idx_lookup: Dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
    self.records: List[VariantRecord] = []
    self.header = VariantHeader()
    for line in VariantBuilder._build_header_string(sd=self.sd):
        self.header.add_line(line)
    if sample_ids is not None:
        self.header.add_samples(sample_ids)
add
add(contig: Optional[str] = None, pos: int = 1000, end: Optional[int] = None, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) -> VariantRecord

Generates a new variant and adds it to the internal collection.

Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is the property that should be accessed when using the records produced by this function (not "start").

Parameters:

Name Type Description Default
contig Optional[str]

the chromosome name. If None, will use the first contig in the sequence dictionary.

None
pos int

the 1-based position of the variant

1000
end Optional[int]

an optional 1-based inclusive END position; if not specified a value will be looked for in info["END"], or calculated from the length of the reference allele

None
id str

the variant id

'.'
ref str

the reference allele

'A'
alts Union[None, str, Iterable[str]]

the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.

('.',)
qual int

the variant quality

60
filter Union[None, str, Iterable[str]]

the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.

None
info Optional[Dict[str, Any]]

the dictionary of INFO key-value pairs

None
samples Optional[Dict[str, Dict[str, Any]]]

the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (".") for samples that don't have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.

None
Source code in fgpyo/vcf/builder.py
def add(
    self,
    contig: Optional[str] = None,
    pos: int = 1000,
    end: Optional[int] = None,
    id: str = ".",
    ref: str = "A",
    alts: Union[None, str, Iterable[str]] = (".",),
    qual: int = 60,
    filter: Union[None, str, Iterable[str]] = None,
    info: Optional[Dict[str, Any]] = None,
    samples: Optional[Dict[str, Dict[str, Any]]] = None,
) -> VariantRecord:
    """Generates a new variant and adds it to the internal collection.

    Notes:
    * Very little validation is done with respect to INFO and FORMAT keys being defined in the
    header.
    * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
    VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
    the property that should be accessed when using the records produced by this function (not
    "start").

    Args:
        contig: the chromosome name. If None, will use the first contig in the sequence
                dictionary.
        pos: the 1-based position of the variant
        end: an optional 1-based inclusive END position; if not specified a value will be looked
             for in info["END"], or calculated from the length of the reference allele
        id: the variant id
        ref: the reference allele
        alts: the list of alternate alleles, None if no alternates. If a single string is
              passed, that will be used as the only alt.
        qual: the variant quality
        filter: the list of filters, None if no filters (ex. PASS). If a single string is
                passed, that will be used as the only filter.
        info: the dictionary of INFO key-value pairs
        samples: the dictionary from sample name to FORMAT key-value pairs.
                 if a sample property is supplied for any sample but omitted in some, it will
                 be set to missing (".") for samples that don't have that property explicitly
                 assigned. If a sample in the VCF is omitted, all its properties will be set to
                 missing.
    """
    if contig is None:
        contig = next(iter(self.sd.keys()))

    if contig not in self.sd:
        raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
    # because there are a lot of slightly different objects related to samples or called
    # "samples" in this function, we alias samples to sample_formats
    # we still want to keep the API labeled "samples" because that keeps the naming scheme the
    # same as the pysam API
    sample_formats = samples
    if sample_formats is not None:
        unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
        if len(unknown_samples) > 0:
            raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

    if isinstance(alts, str):
        alts = (alts,)
    alleles = (ref,) if alts is None else (ref, *alts)
    if isinstance(filter, str):
        filter = (filter,)

    # pysam expects a list of format dicts provided in the same order as the samples in the
    # header (self.sample_ids). (This is despite the fact that it will internally represent the
    # values as a map from sample ID to format values, as we do in this function.)
    # Convert to that form and rename to record_samples; to a) disambiguate from the input
    # values, and b) prevent mypy from complaining about the type changing from dict to list.
    if self.num_samples == 0:
        # this is a sites-only VCF
        record_samples = None
    elif sample_formats is None or len(sample_formats) == 0:
        # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
        # no fields)
        record_samples = None
    else:
        # convert to list form that pysam expects, in order pysam expects
        # note: the copy {**format_dict} below is present because pysam actually alters the
        # input values, which would be an unintended side-effect (in fact without this, tests
        # fail because the expected input values are changed)
        record_samples = [
            {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
        ]

    variant = self.header.new_record(
        contig=contig,
        start=pos - 1,  # start is 0-based
        stop=self._compute_and_check_end(pos, ref, end, info),
        id=id,
        alleles=alleles,
        qual=qual,
        filter=filter,
        info=info,
        samples=record_samples,
    )

    self.records.append(variant)
    return variant
add_filter_header
add_filter_header(name: str, description: Optional[str] = None) -> None

Add a FILTER header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
description Optional[str]

the description of the field

None
Source code in fgpyo/vcf/builder.py
def add_filter_header(
    self,
    name: str,
    description: Optional[str] = None,
) -> None:
    """
    Add a FILTER header field to the VCF header.

    Args:
        name: the name of the field
        description: the description of the field
    """
    header_line = f"##FILTER=<ID={name}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)
add_format_header
add_format_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = NUM_GENOTYPES, description: Optional[str] = None) -> None

Add a FORMAT header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
field_type VcfFieldType

the field_type of the field

required
number Union[int, VcfFieldNumber]

the number of the field

NUM_GENOTYPES
description Optional[str]

the description of the field

None
Source code in fgpyo/vcf/builder.py
def add_format_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: Union[int, VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES,
    description: Optional[str] = None,
) -> None:
    """
    Add a FORMAT header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
    """
    if isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)
add_header_line
add_header_line(line: str) -> None

Adds a header line to the header

Source code in fgpyo/vcf/builder.py
def add_header_line(self, line: str) -> None:
    """Adds a header line to the header"""
    self.header.add_line(line)
add_info_header
add_info_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) -> None

Add an INFO header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
field_type VcfFieldType

the field_type of the field

required
number Union[int, VcfFieldNumber]

the number of the field

1
description Optional[str]

the description of the field

None
source Optional[str]

the source of the field

None
version Optional[str]

the version of the field

None
Source code in fgpyo/vcf/builder.py
def add_info_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: Union[int, VcfFieldNumber] = 1,
    description: Optional[str] = None,
    source: Optional[str] = None,
    version: Optional[str] = None,
) -> None:
    """Add an INFO header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
        source: the source of the field
        version: the version of the field
    """
    if field_type == VcfFieldType.FLAG:
        num = "0"  # FLAGs always have number = 0
    elif isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    if source is not None:
        header_line += f",Source={source}"
    if version is not None:
        header_line += f",Version={version}"
    header_line += ">"
    self.add_header_line(header_line)
default_sd classmethod
default_sd() -> Dict[str, Dict[str, Any]]

Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.

Returns:

Type Description
Dict[str, Dict[str, Any]]

A new copy of the sequence dictionary as a map of contig name to dictionary, one per

Dict[str, Dict[str, Any]]

contig.

Source code in fgpyo/vcf/builder.py
@classmethod
def default_sd(cls) -> Dict[str, Dict[str, Any]]:
    """Generates the sequence dictionary that is used by default by VariantBuilder.
    Re-uses the dictionary from SamBuilder for consistency.

    Returns:
        A new copy of the sequence dictionary as a map of contig name to dictionary, one per
        contig.
    """
    sd: Dict[str, Dict[str, Any]] = {}
    for sequence in SamBuilder.default_sd():
        contig = sequence["SN"]
        sd[contig] = {"ID": contig, "length": sequence["LN"]}
    return sd
to_path
to_path(path: Optional[Path] = None) -> Path

Returns a path to a VCF for variants added to this builder.

If the path given ends in ".gz" then the generated file will be bgzipped and a tabix index generated for the file with the suffix ".gz.tbi".

Parameters:

Name Type Description Default
path Optional[Path]

optional path to the VCF

None
Source code in fgpyo/vcf/builder.py
def to_path(self, path: Optional[Path] = None) -> Path:
    """
    Returns a path to a VCF for variants added to this builder.

    If the path given ends in ".gz" then the generated file will be bgzipped and
    a tabix index generated for the file with the suffix ".gz.tbi".

    Args:
        path: optional path to the VCF
    """
    # update the path
    path = self._to_vcf_path(path)

    # Create a writer and write to it
    with PysamWriter(path, header=self.header) as writer:
        for variant in self.to_sorted_list():
            writer.write(variant)

    if str(path.suffix) == ".gz":
        pysam.tabix_index(str(path), preset="vcf", force=True)

    return path
to_sorted_list
to_sorted_list() -> List[VariantRecord]

Returns the accumulated records in coordinate order.

Source code in fgpyo/vcf/builder.py
def to_sorted_list(self) -> List[VariantRecord]:
    """Returns the accumulated records in coordinate order."""
    return sorted(self.records, key=self._sort_key)
to_unsorted_list
to_unsorted_list() -> List[VariantRecord]

Returns the accumulated records in the order they were created.

Source code in fgpyo/vcf/builder.py
def to_unsorted_list(self) -> List[VariantRecord]:
    """Returns the accumulated records in the order they were created."""
    return list(self.records)
VcfFieldNumber

Bases: Enum

Special codes for VCF field numbers

Source code in fgpyo/vcf/builder.py
class VcfFieldNumber(Enum):
    """Special codes for VCF field numbers"""

    NUM_ALT_ALLELES = "A"
    NUM_ALLELES = "R"
    NUM_GENOTYPES = "G"
    UNKNOWN = "."
VcfFieldType

Bases: Enum

Codes for VCF field types

Source code in fgpyo/vcf/builder.py
class VcfFieldType(Enum):
    """Codes for VCF field types"""

    INTEGER = "Integer"
    FLOAT = "Float"
    FLAG = "Flag"
    CHARACTER = "Character"
    STRING = "String"

Functions