Skip to content

fgpyo

Classes

RequirementError

Bases: Exception

Exception raised when a requirement is not satisfied.

Source code in fgpyo/_requirements.py
class RequirementError(Exception):
    """Exception raised when a requirement is not satisfied."""

Functions

require

require(condition: bool, message: Union[str, Callable[[], str], None] = None) -> None

Require a condition be satisfied.

Parameters:

Name Type Description Default
condition bool

The condition to satisfy.

required
message Union[str, Callable[[], str], None]

An optional message to include with the error when the condition is false. The message may be provided as either a string literal or a function returning a string. The function will not be evaluated unless the condition is false.

None

Raises:

Type Description
RequirementError

If the condition is false.

Source code in fgpyo/_requirements.py
def require(condition: bool, message: Union[str, Callable[[], str], None] = None) -> None:
    """Require a condition be satisfied.

    Args:
        condition: The condition to satisfy.
        message: An optional message to include with the error when the condition is false.
            The message may be provided as either a string literal or a function returning a string.
            The function will not be evaluated unless the condition is false.

    Raises:
        RequirementError: If the condition is false.
    """
    if not condition:
        if message is None:
            raise RequirementError()
        elif isinstance(message, str):
            raise RequirementError(message)
        else:
            raise RequirementError(message())

Modules

collections

Custom Collections and Collection Functions

This module contains classes and functions for working with collections and iterators.

Helpful Functions for Working with Collections

To test if an iterable is sorted or not:

>>> from fgpyo.collections import is_sorted
>>> is_sorted([])
True
>>> is_sorted([1])
True
>>> is_sorted([1, 2, 2, 3])
True
>>> is_sorted([1, 2, 4, 3])
False
Examples of a "Peekable" Iterator

"Peekable" iterators are useful to "peek" at the next item in an iterator without consuming it. For example, this is useful when consuming items in iterator while a predicate is true, and not consuming the first element where the element is not true. See the takewhile() and dropwhile() methods.

An empty peekable iterator throws a StopIteration:

>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
Traceback (most recent call last):
    ...
StopIteration

A peekable iterator will return the next item before consuming it.

>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]

The can_peek() function can be used to determine if the iterator can be peeked without a StopIteration from being thrown:

>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
Traceback (most recent call last):
    ...
StopIteration

PeekableIterator's constructor supports creation from iterable objects as well as iterators.

Attributes

LessThanOrEqualType module-attribute
LessThanOrEqualType = TypeVar('LessThanOrEqualType', bound=SupportsLessThanOrEqual)

A type variable for an object that supports less-than-or-equal comparisons.

Classes

PeekableIterator

Bases: Generic[IterType], Iterator[IterType]

A peekable iterator wrapping an iterator or iterable.

This allows returning the next item without consuming it.

Parameters:

Name Type Description Default
source Union[Iterator[IterType], Iterable[IterType]]

an iterator over the objects

required
Source code in fgpyo/collections/__init__.py
class PeekableIterator(Generic[IterType], Iterator[IterType]):
    """A peekable iterator wrapping an iterator or iterable.

    This allows returning the next item without consuming it.

    Args:
        source: an iterator over the objects
    """

    def __init__(self, source: Union[Iterator[IterType], Iterable[IterType]]) -> None:
        self._iter: Iterator[IterType] = iter(source)
        self._sentinel: Any = object()
        self.__update_peek()

    def __iter__(self) -> Iterator[IterType]:
        return self

    def __next__(self) -> IterType:
        to_return = self.peek()
        self.__update_peek()
        return to_return

    def __update_peek(self) -> None:
        self._peek = next(self._iter, self._sentinel)

    def can_peek(self) -> bool:
        """Returns true if there is a value that can be peeked at, false otherwise."""
        return self._peek is not self._sentinel

    def peek(self) -> IterType:
        """Returns the next element without consuming it, or StopIteration otherwise."""
        if self.can_peek():
            return self._peek
        else:
            raise StopIteration

    def takewhile(self, pred: Callable[[IterType], bool]) -> List[IterType]:
        """Consumes from the iterator while pred is true, and returns the result as a List.

        The iterator is left pointing at the first non-matching item, or if all items match
        then the iterator will be exhausted.

        Args:
            pred: a function that takes the next value from the iterator and returns
                  true or false.

        Returns:
            List[V]: A list of the values from the iterator, in order, up until and excluding
            the first value that does not match the predicate.
        """
        xs: List[IterType] = []
        while self.can_peek() and pred(self._peek):
            xs.append(next(self))
        return xs

    def dropwhile(self, pred: Callable[[IterType], bool]) -> "PeekableIterator[IterType]":
        """Drops elements from the iterator while the predicate is true.

        Updates the iterator to point at the first non-matching element, or exhausts the
        iterator if all elements match the predicate.

        Args:
            pred (Callable[[V], bool]): a function that takes a value from the iterator
                and returns true or false.

        Returns:
            PeekableIterator[V]: a reference to this iterator, so calls can be chained
        """
        while self.can_peek() and pred(self._peek):
            self.__update_peek()
        return self
Functions
can_peek
can_peek() -> bool

Returns true if there is a value that can be peeked at, false otherwise.

Source code in fgpyo/collections/__init__.py
def can_peek(self) -> bool:
    """Returns true if there is a value that can be peeked at, false otherwise."""
    return self._peek is not self._sentinel
dropwhile
dropwhile(pred: Callable[[IterType], bool]) -> PeekableIterator[IterType]

Drops elements from the iterator while the predicate is true.

Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.

Parameters:

Name Type Description Default
pred Callable[[V], bool]

a function that takes a value from the iterator and returns true or false.

required

Returns:

Type Description
PeekableIterator[IterType]

PeekableIterator[V]: a reference to this iterator, so calls can be chained

Source code in fgpyo/collections/__init__.py
def dropwhile(self, pred: Callable[[IterType], bool]) -> "PeekableIterator[IterType]":
    """Drops elements from the iterator while the predicate is true.

    Updates the iterator to point at the first non-matching element, or exhausts the
    iterator if all elements match the predicate.

    Args:
        pred (Callable[[V], bool]): a function that takes a value from the iterator
            and returns true or false.

    Returns:
        PeekableIterator[V]: a reference to this iterator, so calls can be chained
    """
    while self.can_peek() and pred(self._peek):
        self.__update_peek()
    return self
peek
peek() -> IterType

Returns the next element without consuming it, or StopIteration otherwise.

Source code in fgpyo/collections/__init__.py
def peek(self) -> IterType:
    """Returns the next element without consuming it, or StopIteration otherwise."""
    if self.can_peek():
        return self._peek
    else:
        raise StopIteration
takewhile
takewhile(pred: Callable[[IterType], bool]) -> List[IterType]

Consumes from the iterator while pred is true, and returns the result as a List.

The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.

Parameters:

Name Type Description Default
pred Callable[[IterType], bool]

a function that takes the next value from the iterator and returns true or false.

required

Returns:

Type Description
List[IterType]

List[V]: A list of the values from the iterator, in order, up until and excluding

List[IterType]

the first value that does not match the predicate.

Source code in fgpyo/collections/__init__.py
def takewhile(self, pred: Callable[[IterType], bool]) -> List[IterType]:
    """Consumes from the iterator while pred is true, and returns the result as a List.

    The iterator is left pointing at the first non-matching item, or if all items match
    then the iterator will be exhausted.

    Args:
        pred: a function that takes the next value from the iterator and returns
              true or false.

    Returns:
        List[V]: A list of the values from the iterator, in order, up until and excluding
        the first value that does not match the predicate.
    """
    xs: List[IterType] = []
    while self.can_peek() and pred(self._peek):
        xs.append(next(self))
    return xs
SupportsLessThanOrEqual

Bases: Protocol

A structural type for objects that support less-than-or-equal comparison.

Source code in fgpyo/collections/__init__.py
class SupportsLessThanOrEqual(Protocol):
    """A structural type for objects that support less-than-or-equal comparison."""

    def __le__(self, other: Any) -> bool: ...

Functions

is_sorted
is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool

Tests lazily if an iterable of comparable objects is sorted or not.

Parameters:

Name Type Description Default
iterable Iterable[LessThanOrEqualType]

An iterable of comparable objects.

required

Raises:

Type Description
TypeError

If there is more than 1 element in iterable and any of the elements are not comparable.

Source code in fgpyo/collections/__init__.py
def is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool:
    """Tests lazily if an iterable of comparable objects is sorted or not.

    Args:
        iterable: An iterable of comparable objects.

    Raises:
        TypeError: If there is more than 1 element in ``iterable`` and any of the elements are not
            comparable.
    """
    return all(map(lambda pair: le(*pair), _pairwise(iterable)))

fasta

Modules

builder
Classes for generating fasta files and records for testing

This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).

Examples of creating sets of contigs for writing to fasta

Writing a FASTA with two contigs each with 100 bases:

>>> from pathlib import Path
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder = builder.add("chr11").add("GGGGGGGGGG", 10)
>>> fasta_path = Path(getfixture("tmp_path")) / "test.fasta"
>>> builder.to_file(path=fasta_path)  

Writing a FASTA with one contig with 100 A's and 50 T's:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder.to_file(path=fasta_path)  

Add bases to existing contig:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> contig_one.bases
'AAAAAAAAAANNN'
Classes
ContigBuilder

Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.

Attributes:

Name Type Description
name

Unique contig ID, ie., "chr10"

assembly

Assembly information, if None default is 'testassembly'

species

Species information, if None default is 'testspecies'

bases

The bases to be added to the contig ex "A"

Source code in fgpyo/fasta/builder.py
class ContigBuilder:
    """Builder for constructing new contigs, and adding bases to existing contigs.
    Existing contigs cannot be overwritten, each contig name in FastaBuilder must
    be unique. Instances of ContigBuilders should be created using FastaBuilder.add(),
    where species and assembly are optional parameters and will defualt to
    FastaBuilder.assembly and FastaBuilder.species.

    Attributes:
        name: Unique contig ID, ie., "chr10"
        assembly: Assembly information, if None default is 'testassembly'
        species: Species information, if None default is 'testspecies'
        bases:  The bases to be added to the contig ex "A"

    """

    def __init__(
        self,
        name: str,
        assembly: str,
        species: str,
    ):
        self.name = name
        self.assembly = assembly
        self.species = species
        self.bases = ""

    def add(self, bases: str, times: int = 1) -> "ContigBuilder":
        """
        Method for adding bases to a new or existing instance of ContigBuilder.

        Args:
            bases: The bases to be added to the contig
            times: The number of times the bases should be repeated

        Example
        add("AAA", 2) results in the following bases -> "AAAAAA"
        """
        # Remove any spaces in string and enforce upper case format
        bases = bases.replace(" ", "").upper()
        self.bases += str(bases * times)
        return self
Functions
add
add(bases: str, times: int = 1) -> ContigBuilder

Method for adding bases to a new or existing instance of ContigBuilder.

Parameters:

Name Type Description Default
bases str

The bases to be added to the contig

required
times int

The number of times the bases should be repeated

1

Example add("AAA", 2) results in the following bases -> "AAAAAA"

Source code in fgpyo/fasta/builder.py
def add(self, bases: str, times: int = 1) -> "ContigBuilder":
    """
    Method for adding bases to a new or existing instance of ContigBuilder.

    Args:
        bases: The bases to be added to the contig
        times: The number of times the bases should be repeated

    Example
    add("AAA", 2) results in the following bases -> "AAAAAA"
    """
    # Remove any spaces in string and enforce upper case format
    bases = bases.replace(" ", "").upper()
    self.bases += str(bases * times)
    return self
FastaBuilder

Builder for constructing sets of one or more contigs.

Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.

A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.

Contigs are added to FastaBuilder using: add()

Bases are added to existing contigs using: add()

Once accumulated the contigs can be written to a file using: to_file()

Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

Attributes:

Name Type Description
assembly str

Assembly information, if None default is 'testassembly'

species str

Species, if None default is 'testspecies'

line_length int

Desired line length, if None default is 80

contig_builders int

Private dictionary of contig names and instances of ContigBuilder

Source code in fgpyo/fasta/builder.py
class FastaBuilder:
    """Builder for constructing sets of one or more contigs.

    Provides the ability to manufacture sets of contigs from minimal input, and automatically
    generates the information necessary for writing the FASTA file, index, and dictionary.

    A builder is constructed from an assembly, species, and line length. All attributes have
    defaults, however these can be overwritten.

    Contigs are added to FastaBuilder using:
    [`add()`][fgpyo.fasta.builder.FastaBuilder.add]

    Bases are added to existing contigs using:
    [`add()`][fgpyo.fasta.builder.ContigBuilder.add]

    Once accumulated the contigs can be written to a file using:
    [`to_file()`][fgpyo.fasta.builder.FastaBuilder.to_file]

    Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

    Attributes:
        assembly: Assembly information, if None default is 'testassembly'
        species: Species, if None default is 'testspecies'
        line_length: Desired line length, if None default is 80
        contig_builders: Private dictionary of contig names and instances of ContigBuilder
    """

    def __init__(
        self,
        assembly: str = "testassembly",
        species: str = "testspecies",
        line_length: int = 80,
    ):
        self.assembly: str = assembly
        self.species: str = species
        self.line_length: int = line_length
        self.__contig_builders: Dict[str, ContigBuilder] = {}

    def __getitem__(self, key: str) -> ContigBuilder:
        """Access instance of ContigBuilder by name"""
        return self.__contig_builders[key]

    def add(
        self,
        name: str,
        assembly: Optional[str] = None,
        species: Optional[str] = None,
    ) -> ContigBuilder:
        """
        Creates and returns a new ContigBuilder for a contig with the provided name.
        Contig names must be unique, attempting to create two seperate contigs with the same
        name will result in an error.

        Args:
            name: Unique contig ID, ie., "chr10"
            assembly: Assembly information, if None default is 'testassembly'
            species: Species information, if None default is 'testspecies'
        """
        # Asign self.species and self.assembly to assembly and species if parameter is None
        assembly = assembly if assembly is not None else self.assembly
        species = species if species is not None else self.species

        # Assert that the provided name does not already exist
        assert name not in self.__contig_builders, (
            f"The contig {name} already exists, see docstring for methods on "
            f"adding bases to existing contigs"
        )
        builder: ContigBuilder = ContigBuilder(name=name, assembly=assembly, species=species)
        self.__contig_builders[name] = builder
        return builder

    def to_file(
        self,
        path: Path,
    ) -> None:
        """
        Writes out the set of accumulated contigs to a FASTA file at the `path` given.
        Also generates the accompanying fasta index file (`.fa.fai`) and sequence
        dictionary file (`.dict`).

        Contigs are emitted in the order they were added to the builder.  Sequence
        lines in the FASTA file are wrapped to the line length given when the builder
        was constructed.

        Args:
            path: Path to write files to.

        Example:
        FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
        """

        with path.open("w") as writer:
            for contig in self.__contig_builders.values():
                try:
                    writer.write(f">{contig.name}")
                    writer.write("\n")
                    for line in textwrap.wrap(contig.bases, self.line_length):
                        writer.write(line)
                        writer.write("\n")
                except OSError as error:
                    raise Exception(f"Could not write to {writer}") from error

        # Index fasta
        pysam_faidx(str(path))

        # Write dictionary
        pysam_dict(
            assembly=self.assembly,
            species=self.species,
            output_path=str(f"{path}.dict"),
            input_path=str(path),
        )
Functions
__getitem__
__getitem__(key: str) -> ContigBuilder

Access instance of ContigBuilder by name

Source code in fgpyo/fasta/builder.py
def __getitem__(self, key: str) -> ContigBuilder:
    """Access instance of ContigBuilder by name"""
    return self.__contig_builders[key]
add
add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) -> ContigBuilder

Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.

Parameters:

Name Type Description Default
name str

Unique contig ID, ie., "chr10"

required
assembly Optional[str]

Assembly information, if None default is 'testassembly'

None
species Optional[str]

Species information, if None default is 'testspecies'

None
Source code in fgpyo/fasta/builder.py
def add(
    self,
    name: str,
    assembly: Optional[str] = None,
    species: Optional[str] = None,
) -> ContigBuilder:
    """
    Creates and returns a new ContigBuilder for a contig with the provided name.
    Contig names must be unique, attempting to create two seperate contigs with the same
    name will result in an error.

    Args:
        name: Unique contig ID, ie., "chr10"
        assembly: Assembly information, if None default is 'testassembly'
        species: Species information, if None default is 'testspecies'
    """
    # Asign self.species and self.assembly to assembly and species if parameter is None
    assembly = assembly if assembly is not None else self.assembly
    species = species if species is not None else self.species

    # Assert that the provided name does not already exist
    assert name not in self.__contig_builders, (
        f"The contig {name} already exists, see docstring for methods on "
        f"adding bases to existing contigs"
    )
    builder: ContigBuilder = ContigBuilder(name=name, assembly=assembly, species=species)
    self.__contig_builders[name] = builder
    return builder
to_file
to_file(path: Path) -> None

Writes out the set of accumulated contigs to a FASTA file at the path given. Also generates the accompanying fasta index file (.fa.fai) and sequence dictionary file (.dict).

Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.

Parameters:

Name Type Description Default
path Path

Path to write files to.

required

Example: FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))

Source code in fgpyo/fasta/builder.py
def to_file(
    self,
    path: Path,
) -> None:
    """
    Writes out the set of accumulated contigs to a FASTA file at the `path` given.
    Also generates the accompanying fasta index file (`.fa.fai`) and sequence
    dictionary file (`.dict`).

    Contigs are emitted in the order they were added to the builder.  Sequence
    lines in the FASTA file are wrapped to the line length given when the builder
    was constructed.

    Args:
        path: Path to write files to.

    Example:
    FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
    """

    with path.open("w") as writer:
        for contig in self.__contig_builders.values():
            try:
                writer.write(f">{contig.name}")
                writer.write("\n")
                for line in textwrap.wrap(contig.bases, self.line_length):
                    writer.write(line)
                    writer.write("\n")
            except OSError as error:
                raise Exception(f"Could not write to {writer}") from error

    # Index fasta
    pysam_faidx(str(path))

    # Write dictionary
    pysam_dict(
        assembly=self.assembly,
        species=self.species,
        output_path=str(f"{path}.dict"),
        input_path=str(path),
    )
Functions
pysam_dict
pysam_dict(assembly: str, species: str, output_path: str, input_path: str) -> None

Calls pysam.dict and writes the sequence dictionary to the provided output path

Args assembly: Assembly species: Species output_path: File path to write dictionary to input_path: Path to fasta file

Source code in fgpyo/fasta/builder.py
def pysam_dict(assembly: str, species: str, output_path: str, input_path: str) -> None:
    """Calls pysam.dict and writes the sequence dictionary to the provided output path

    Args
        assembly: Assembly
        species: Species
        output_path: File path to write dictionary to
        input_path: Path to fasta file
    """
    samtools_dict("-a", assembly, "-s", species, "-o", output_path, input_path)
pysam_faidx
pysam_faidx(input_path: str) -> None

Calls pysam.faidx and writes fasta index in the same file location as the fasta file

Args input_path: Path to fasta file

Source code in fgpyo/fasta/builder.py
def pysam_faidx(input_path: str) -> None:
    """Calls pysam.faidx and writes fasta index in the same file location as the fasta file

    Args
        input_path: Path to fasta file
    """
    samtools_faidx(input_path)
sequence_dictionary
Classes for representing sequencing dictionaries.
Examples of building and using sequence dictionaries

Building a sequence dictionary from a pysam.AlignmentHeader:

>>> import pysam
>>> from fgpyo.fasta.sequence_dictionary import SequenceDictionary
>>> sd: SequenceDictionary
>>> with pysam.AlignmentFile("./tests/fgpyo/sam/data/valid.sam") as fh:
...     sd = SequenceDictionary.from_sam(fh.header)
>>> print(sd)  
@SQ     SN:chr1 LN:101
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202

Query based on index:

>>> print(sd[3])  
@SQ     SN:chr4 LN:101

Query based on name:

>>> print(sd["chr6"])  
@SQ     SN:chr6 LN:101

Add, get, and delete attributes:

>>> from fgpyo.fasta.sequence_dictionary import Keys
>>> meta = sd[0]
>>> print(meta)  
@SQ     SN:chr1 LN:101
>>> meta[Keys.ASSEMBLY] = "hg38"
>>> print(meta)  
@SQ     SN:chr1 LN:101  AS:hg38
>>> meta.get(Keys.ASSEMBLY)
'hg38'
>>> meta.get(Keys.SPECIES) is None
True
>>> Keys.MD5 in meta
False
>>> del meta[Keys.ASSEMBLY]
>>> print(meta)  
@SQ     SN:chr1 LN:101

Get a sequence based on one of its aliases

>>> meta[Keys.ALIASES] = "foo,bar,car"
>>> sd = SequenceDictionary(infos=[meta] + sd.infos[1:])
>>> print(sd)  
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202
>>> print(sd["chr1"])  
@SQ     SN:chr1 LN:101  AN:foo,bar,car
>>> print(sd["bar"])  
@SQ     SN:chr1 LN:101  AN:foo,bar,car

Create a pysam.AlignmentHeader from a sequence dictionary:

>>> sd.to_sam_header()  
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header())  
@HD     VN:1.5
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202

Create a pysam.AlignmentHeader from a sequence dictionary with extra header items:

>>> sd.to_sam_header(
...     extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... )  
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header(
...     extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... ))  
@HD     VN:1.5
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202
@RG     ID:A    LB:a-library
@RG     ID:B    LB:b-library
Attributes
SEQUENCE_NAME_PATTERN module-attribute
SEQUENCE_NAME_PATTERN: Pattern = compile('^[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*$')

Regular expression for valid reference sequence names according to the SAM spec

Classes
AlternateLocus dataclass

Stores an alternate locus for an associated sequence (1-based inclusive)

Source code in fgpyo/fasta/sequence_dictionary.py
@dataclass(frozen=True, init=True)
class AlternateLocus:
    """Stores an alternate locus for an associated sequence (1-based inclusive)"""

    name: str
    start: int
    end: int

    def __post_init__(self) -> None:
        """Any post initialization validation should go here"""
        if self.start > self.end:
            raise ValueError(f"start > end: {self.start} > {self.end}")
        if self.start < 1:
            raise ValueError(f"start < 1: {self.start}")

    def __str__(self) -> str:
        return f"{self.name}:{self.start}-{self.end}"

    def __len__(self) -> int:
        return self.end - self.start + 1

    @staticmethod
    def parse(value: str) -> "AlternateLocus":
        """Parse the genomic interval of format: `<contig>:<start>-<end>`"""
        name, rest = value.split(":", maxsplit=1)
        start, end = rest.split("-", maxsplit=1)
        return AlternateLocus(name=name, start=int(start), end=int(end))
Functions
__post_init__
__post_init__() -> None

Any post initialization validation should go here

Source code in fgpyo/fasta/sequence_dictionary.py
def __post_init__(self) -> None:
    """Any post initialization validation should go here"""
    if self.start > self.end:
        raise ValueError(f"start > end: {self.start} > {self.end}")
    if self.start < 1:
        raise ValueError(f"start < 1: {self.start}")
parse staticmethod
parse(value: str) -> AlternateLocus

Parse the genomic interval of format: <contig>:<start>-<end>

Source code in fgpyo/fasta/sequence_dictionary.py
@staticmethod
def parse(value: str) -> "AlternateLocus":
    """Parse the genomic interval of format: `<contig>:<start>-<end>`"""
    name, rest = value.split(":", maxsplit=1)
    start, end = rest.split("-", maxsplit=1)
    return AlternateLocus(name=name, start=int(start), end=int(end))
Keys

Bases: StrEnum

Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line).

Source code in fgpyo/fasta/sequence_dictionary.py
@unique
class Keys(StrEnum):
    """Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line)."""

    ALIASES = "AN"
    ALTERNATE_LOCUS = "AH"
    ASSEMBLY = "AS"
    DESCRIPTION = "DS"
    SEQUENCE_LENGTH = "LN"
    MD5 = "M5"
    SEQUENCE_NAME = "SN"
    SPECIES = "SP"
    TOPOLOGY = "TP"
    URI = "UR"

    @staticmethod
    def attributes() -> List[str]:
        """The list of keys that are allowed to be attributes in `SequenceMetadata`.  Notably
        `SEQUENCE_LENGTH` and `SEQUENCE_NAME` are not allowed."""
        return [key for key in Keys if key != Keys.SEQUENCE_NAME and key != Keys.SEQUENCE_LENGTH]
Functions
attributes staticmethod
attributes() -> List[str]

The list of keys that are allowed to be attributes in SequenceMetadata. Notably SEQUENCE_LENGTH and SEQUENCE_NAME are not allowed.

Source code in fgpyo/fasta/sequence_dictionary.py
@staticmethod
def attributes() -> List[str]:
    """The list of keys that are allowed to be attributes in `SequenceMetadata`.  Notably
    `SEQUENCE_LENGTH` and `SEQUENCE_NAME` are not allowed."""
    return [key for key in Keys if key != Keys.SEQUENCE_NAME and key != Keys.SEQUENCE_LENGTH]
SequenceDictionary dataclass

Bases: Mapping[Union[str, int], SequenceMetadata]

Contains an ordered collection of sequences.

A specific SequenceMetadata may be retrieved by name (str) or index (int), either by using the generic get method or by the correspondingly named by_name and by_index methods. The latter methods provide faster retrieval when the type is known.

This mapping collection iterates over the keys. To iterate over each SequenceMetadata, either use the typical values() method or access the metadata directly with infos.

Attributes:

Name Type Description
infos List[SequenceMetadata]

the ordered collection of sequence metadata

Source code in fgpyo/fasta/sequence_dictionary.py
@dataclass(frozen=True, init=True)
class SequenceDictionary(Mapping[Union[str, int], SequenceMetadata]):
    """Contains an ordered collection of sequences.

    A specific `SequenceMetadata` may be retrieved by name (`str`) or index (`int`), either by
    using the generic `get` method or by the correspondingly named `by_name` and `by_index` methods.
    The latter methods provide faster retrieval when the type is known.

    This _mapping_ collection iterates over the _keys_.  To iterate over each `SequenceMetadata`,
    either use the typical `values()` method or access the metadata directly with `infos`.

    Attributes:
        infos: the ordered collection of sequence metadata
    """

    infos: List[SequenceMetadata]
    _dict: Dict[str, SequenceMetadata] = field(init=False, repr=False)

    def __post_init__(self) -> None:
        # Initialize a mapping from sequence name to the sequence metadata for all names
        self_dict: Dict[str, SequenceMetadata] = {}
        for index, info in enumerate(self.infos):
            if info.index != index:
                raise ValueError(
                    "Infos must be given with index set correctly."
                    + f"  See ${index}th with name: {info.name}"
                )
            for name in info.all_names:
                if name in self_dict:
                    raise ValueError(f"Found duplicate sequence name: {name}")
                self_dict[name] = info
        object.__setattr__(self, "_dict", self_dict)

    def same_as(self, other: "SequenceDictionary") -> bool:
        """Returns true if the sequences share a common reference name (including aliases), have
        the same length, and the same MD5 if both have MD5s"""
        if len(self) != len(other):
            return False
        return all(this.same_as(that) for this, that in zip(self.infos, other.infos))

    def to_sam(self) -> List[Dict[str, Any]]:
        """Converts the list of dictionaries, one per sequence."""
        return [meta.to_sam() for meta in self.infos]

    def to_sam_header(
        self,
        extra_header: Optional[Dict[str, Any]] = None,
    ) -> pysam.AlignmentHeader:
        """Converts the sequence dictionary to a `pysam.AlignmentHeader`.

        Args:
            extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                          `:~pysam.AlignmentHeader` for more details.
        """
        header_dict: Dict[str, Any] = {
            "HD": {"VN": "1.5"},
            "SQ": self.to_sam(),
        }
        if extra_header is not None:
            header_dict = {**header_dict, **extra_header}
        return pysam.AlignmentHeader.from_dict(header_dict=header_dict)

    @staticmethod
    @overload
    def from_sam(data: Path) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: pysam.AlignmentFile) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: pysam.AlignmentHeader) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: List[Dict[str, Any]]) -> "SequenceDictionary": ...

    @staticmethod
    def from_sam(
        data: Union[Path, pysam.AlignmentFile, pysam.AlignmentHeader, List[Dict[str, Any]]],
    ) -> "SequenceDictionary":
        """Creates a `SequenceDictionary` from a SAM file or its header.

        Args:
            data: The input may be any of:
                - a path to a SAM file
                - an open `pysam.AlignmentFile`
                - the `pysam.AlignmentHeader` associated with a `pysam.AlignmentFile`
                - the contents of a header's `SQ` fields, as returned by `AlignmentHeader.to_dict()`
        Returns:
            A `SequenceDictionary` mapping refrence names to their metadata.
        """
        seq_dict: SequenceDictionary
        if isinstance(data, pysam.AlignmentHeader):
            seq_dict = SequenceDictionary.from_sam(data.to_dict()["SQ"])
        elif isinstance(data, pysam.AlignmentFile):
            seq_dict = SequenceDictionary.from_sam(data.header.to_dict()["SQ"])
        elif isinstance(data, Path):
            with sam.reader(data) as fh:
                seq_dict = SequenceDictionary.from_sam(fh.header)
        else:  # assuming `data` is a `list[dict[str, Any]]`
            try:
                infos: List[SequenceMetadata] = [
                    SequenceMetadata.from_sam(meta=meta, index=index)
                    for index, meta in enumerate(data)
                ]
                seq_dict = SequenceDictionary(infos=infos)
            except Exception as e:
                raise ValueError(f"Could not parse sequence information from data: {data}") from e

        return seq_dict

    def __getitem__(self, key: Union[str, int]) -> SequenceMetadata:
        return self._dict[key] if isinstance(key, str) else self.infos[key]

    def get_by_name(self, name: str) -> Optional[SequenceMetadata]:
        """Gets a `SequenceMetadata` explicitly by `name`.  Returns None if
        the name does not exist in this dictionary"""
        return self._dict.get(name)

    def by_name(self, name: str) -> SequenceMetadata:
        """Gets a `SequenceMetadata` explicitly by `name`.  The name must exist."""
        return self._dict[name]

    def by_index(self, index: int) -> SequenceMetadata:
        """Gets a `SequenceMetadata` explicitly by `name`.  Raises an `IndexError`
        if the index is out of bounds."""
        return self.infos[index]

    def __iter__(self) -> Iterator[str]:
        return iter(self._dict)

    def __len__(self) -> int:
        return len(self.infos)

    def __str__(self) -> str:
        return "\n".join(f"{info}" for info in self.infos)
Functions
by_index
by_index(index: int) -> SequenceMetadata

Gets a SequenceMetadata explicitly by name. Raises an IndexError if the index is out of bounds.

Source code in fgpyo/fasta/sequence_dictionary.py
def by_index(self, index: int) -> SequenceMetadata:
    """Gets a `SequenceMetadata` explicitly by `name`.  Raises an `IndexError`
    if the index is out of bounds."""
    return self.infos[index]
by_name
by_name(name: str) -> SequenceMetadata

Gets a SequenceMetadata explicitly by name. The name must exist.

Source code in fgpyo/fasta/sequence_dictionary.py
def by_name(self, name: str) -> SequenceMetadata:
    """Gets a `SequenceMetadata` explicitly by `name`.  The name must exist."""
    return self._dict[name]
from_sam staticmethod
from_sam(data: Path) -> SequenceDictionary
from_sam(data: AlignmentFile) -> SequenceDictionary
from_sam(data: AlignmentHeader) -> SequenceDictionary
from_sam(data: List[Dict[str, Any]]) -> SequenceDictionary
from_sam(data: Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]) -> SequenceDictionary

Creates a SequenceDictionary from a SAM file or its header.

Parameters:

Name Type Description Default
data Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]

The input may be any of: - a path to a SAM file - an open pysam.AlignmentFile - the pysam.AlignmentHeader associated with a pysam.AlignmentFile - the contents of a header's SQ fields, as returned by AlignmentHeader.to_dict()

required

Returns: A SequenceDictionary mapping refrence names to their metadata.

Source code in fgpyo/fasta/sequence_dictionary.py
@staticmethod
def from_sam(
    data: Union[Path, pysam.AlignmentFile, pysam.AlignmentHeader, List[Dict[str, Any]]],
) -> "SequenceDictionary":
    """Creates a `SequenceDictionary` from a SAM file or its header.

    Args:
        data: The input may be any of:
            - a path to a SAM file
            - an open `pysam.AlignmentFile`
            - the `pysam.AlignmentHeader` associated with a `pysam.AlignmentFile`
            - the contents of a header's `SQ` fields, as returned by `AlignmentHeader.to_dict()`
    Returns:
        A `SequenceDictionary` mapping refrence names to their metadata.
    """
    seq_dict: SequenceDictionary
    if isinstance(data, pysam.AlignmentHeader):
        seq_dict = SequenceDictionary.from_sam(data.to_dict()["SQ"])
    elif isinstance(data, pysam.AlignmentFile):
        seq_dict = SequenceDictionary.from_sam(data.header.to_dict()["SQ"])
    elif isinstance(data, Path):
        with sam.reader(data) as fh:
            seq_dict = SequenceDictionary.from_sam(fh.header)
    else:  # assuming `data` is a `list[dict[str, Any]]`
        try:
            infos: List[SequenceMetadata] = [
                SequenceMetadata.from_sam(meta=meta, index=index)
                for index, meta in enumerate(data)
            ]
            seq_dict = SequenceDictionary(infos=infos)
        except Exception as e:
            raise ValueError(f"Could not parse sequence information from data: {data}") from e

    return seq_dict
get_by_name
get_by_name(name: str) -> Optional[SequenceMetadata]

Gets a SequenceMetadata explicitly by name. Returns None if the name does not exist in this dictionary

Source code in fgpyo/fasta/sequence_dictionary.py
def get_by_name(self, name: str) -> Optional[SequenceMetadata]:
    """Gets a `SequenceMetadata` explicitly by `name`.  Returns None if
    the name does not exist in this dictionary"""
    return self._dict.get(name)
same_as
same_as(other: SequenceDictionary) -> bool

Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s

Source code in fgpyo/fasta/sequence_dictionary.py
def same_as(self, other: "SequenceDictionary") -> bool:
    """Returns true if the sequences share a common reference name (including aliases), have
    the same length, and the same MD5 if both have MD5s"""
    if len(self) != len(other):
        return False
    return all(this.same_as(that) for this, that in zip(self.infos, other.infos))
to_sam
to_sam() -> List[Dict[str, Any]]

Converts the list of dictionaries, one per sequence.

Source code in fgpyo/fasta/sequence_dictionary.py
def to_sam(self) -> List[Dict[str, Any]]:
    """Converts the list of dictionaries, one per sequence."""
    return [meta.to_sam() for meta in self.infos]
to_sam_header
to_sam_header(extra_header: Optional[Dict[str, Any]] = None) -> AlignmentHeader

Converts the sequence dictionary to a pysam.AlignmentHeader.

Parameters:

Name Type Description Default
extra_header Optional[Dict[str, Any]]

a dictionary of extra values to add to the header, None otherwise. See :~pysam.AlignmentHeader for more details.

None
Source code in fgpyo/fasta/sequence_dictionary.py
def to_sam_header(
    self,
    extra_header: Optional[Dict[str, Any]] = None,
) -> pysam.AlignmentHeader:
    """Converts the sequence dictionary to a `pysam.AlignmentHeader`.

    Args:
        extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                      `:~pysam.AlignmentHeader` for more details.
    """
    header_dict: Dict[str, Any] = {
        "HD": {"VN": "1.5"},
        "SQ": self.to_sam(),
    }
    if extra_header is not None:
        header_dict = {**header_dict, **extra_header}
    return pysam.AlignmentHeader.from_dict(header_dict=header_dict)
SequenceMetadata dataclass

Bases: MutableMapping[Union[Keys, str], str]

Stores information about a single Sequence (ex. chromosome, contig).

Implements the mutable mapping interface, which provides access to the attributes of this sequence, including name, length, but not index. When using the mapping interface, for example getting, setting, deleting, as well as iterating over keys, values, and items, the values will always be strings (str type). For example, the length will be an str when accessing via get; access the length directly or use len to return an int. Similarly, use the alias property to return a List[str] of aliases, use the alternate property to return an AlternativeLocus-typed instance, and topology property to return a Toplogy-typed instance.

All attributes except name and length may be set. Use dataclasses.replace to create a new copy in such cases.

Important: The len method returns the length of the sequence, not the length of the attributes. Use len(meta.attributes) for the latter.

Attributes:

Name Type Description
name str

the primary name of the sequence

length int

the length of the sequence, or zero if unknown

index int

the index in the sequence dictionary

attributes Dict[Union[Keys, str], str]

attributes of this sequence

Source code in fgpyo/fasta/sequence_dictionary.py
@dataclass(frozen=True, init=True)
class SequenceMetadata(MutableMapping[Union[Keys, str], str]):
    """Stores information about a single Sequence (ex. chromosome, contig).

    Implements the mutable mapping interface, which provides access to the attributes of this
    sequence, including name, length, but not index.  When using the mapping interface, for example
    getting, setting, deleting, as well as iterating over keys, values, and items, the _values_ will
    always be strings (`str` type).  For example, the length will be an `str` when accessing via
    `get`; access the length directly or use `len` to return an `int`.  Similarly, use the
    `alias` property to return a `List[str]` of aliases, use the `alternate` property to return
    an `AlternativeLocus`-typed instance, and `topology` property to return a `Toplogy`-typed
    instance.

    All attributes except name and length may be set.  Use `dataclasses.replace` to create a new
    copy in such cases.

    Important: The `len` method returns the length of the sequence, not the length of the
    attributes.  Use `len(meta.attributes)` for the latter.

    Attributes:
      name: the primary name of the sequence
      length: the length of the sequence, or zero if unknown
      index: the index in the sequence dictionary
      attributes: attributes of this sequence
    """

    name: str
    length: int
    index: int
    attributes: Dict[Union[Keys, str], str] = field(default_factory=dict)

    def __post_init__(self) -> None:
        """Any post initialization validation should go here"""
        if self.length < 0:
            raise ValueError(f"Length must be >= 0 for '{self.name}'")
        if re.search(SEQUENCE_NAME_PATTERN, self.name) is None:
            raise ValueError(f"Illegal name: '{self.name}'")
        if Keys.SEQUENCE_NAME in self.attributes:
            raise ValueError(f"'{Keys.SEQUENCE_NAME}' should not given in the list of attributes")
        if Keys.SEQUENCE_LENGTH in self.attributes:
            raise ValueError(f"'{Keys.SEQUENCE_LENGTH}' should not given in the list of attributes")

    @property
    def aliases(self) -> List[str]:
        """The aliases (not including the primary) name"""
        aliases = self.attributes.get(Keys.ALIASES)
        return [] if aliases is None else aliases.split(",")

    @property
    def all_names(self) -> List[str]:
        """A list of all names, including the primary name and aliases, in that order."""
        return [self.name] + self.aliases

    @property
    def alternate(self) -> Optional[AlternateLocus]:
        """Gets the alternate locus for this sequence"""
        if Keys.ALTERNATE_LOCUS not in self.attributes:
            return None
        value = self.attributes[Keys.ALTERNATE_LOCUS]
        if value == "*":
            return None
        locus = AlternateLocus.parse(value)
        if locus.name == "=":
            locus = replace(locus, name=self.name)
        return locus

    @property
    def is_alternate(self) -> bool:
        """True if there is an alternate locus defined, False otherwise"""
        return self.alternate is not None

    @property
    def md5(self) -> Optional[str]:
        return self.get(Keys.MD5)

    @property
    def assembly(self) -> Optional[str]:
        return self.get(Keys.ASSEMBLY)

    @property
    def uri(self) -> Optional[str]:
        return self.get(Keys.URI)

    @property
    def species(self) -> Optional[str]:
        return self.get(Keys.SPECIES)

    @property
    def description(self) -> Optional[str]:
        return self.get(Keys.DESCRIPTION)

    @property
    def topology(self) -> Optional[Topology]:
        value = self.get(Keys.TOPOLOGY)
        return None if value is None else Topology[value]

    def same_as(self, other: "SequenceMetadata") -> bool:
        """Returns true if the sequences share a common reference name (including aliases), have
        the same length, and the same MD5 if both have MD5s."""
        if self.length != other.length:
            return False
        elif self.name != other.name and other.name not in self.all_names:
            return False
        self_m5 = self.md5
        other_m5 = other.md5
        if self_m5 is None or other_m5 is None:
            return True
        else:
            return self_m5 == other_m5

    def to_sam(self) -> Dict[str, Any]:
        """Converts the sequence metadata to a dictionary equivalent to one item in the
        list of sequences from `pysam.AlignmentHeader#to_dict()["SQ"]`."""
        meta_dict: Dict[str, Any] = {
            f"{Keys.SEQUENCE_NAME}": self.name,
            f"{Keys.SEQUENCE_LENGTH}": self.length,
        }
        if len(self.attributes) > 0:
            meta_dict = {**meta_dict, **self.attributes}

        return meta_dict

    @staticmethod
    def from_sam(meta: Dict[Union[Keys, str], Any], index: int) -> "SequenceMetadata":
        """Builds a `SequenceMetadata` from a dictionary.  The keys must include the sequence
        name (`Keys.SEQUENCE_NAME`) and length (`Keys.SEQUENCE_LENGTH`).  All other keys from
        `Keys` will be stored in the resulting attributes.

        Args:
            meta: the python dictionary with keys from `Keys`.  This is typically the dictionary
                  stored in the `"SQ"` level of the two-level dictionary returned by the
                  `pysam.AlignmentHeader#to_dict()` method.
            index: the 0-based index to use for this sequence
        """
        name = meta[Keys.SEQUENCE_NAME]
        length = meta[Keys.SEQUENCE_LENGTH]
        attributes = copy.deepcopy(meta)
        del attributes[Keys.SEQUENCE_NAME]
        del attributes[Keys.SEQUENCE_LENGTH]
        return SequenceMetadata(name=name, length=length, index=index, attributes=attributes)

    def __getitem__(self, key: Union[Keys, str]) -> Any:
        if key == Keys.SEQUENCE_NAME.value:
            return self.name
        elif key == Keys.SEQUENCE_LENGTH.value:
            return f"{self.length}"
        return self.attributes[key]

    def __setitem__(self, key: Union[Keys, str], value: str) -> None:
        if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
            raise KeyError(f"Cannot set '{key}' on SequenceMetadata with name '{self.name}'")
        self.attributes[key] = value

    def __delitem__(self, key: Union[Keys, str]) -> None:
        if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
            raise KeyError(f"Cannot delete '{key}' on SequenceMetadata with name '{self.name}'")
        del self.attributes[key]

    def __iter__(self) -> Iterator[Union[Keys, str]]:
        pre_iter = iter((Keys.SEQUENCE_NAME, Keys.SEQUENCE_LENGTH))
        return itertools.chain(pre_iter, iter(self.attributes))

    def __len__(self) -> int:
        return self.length

    def __str__(self) -> str:
        return "@SQ\t" + "\t".join(f"{key}:{value}" for key, value in self.to_sam().items())

    def __index__(self) -> int:
        return self.index
Attributes
aliases property
aliases: List[str]

The aliases (not including the primary) name

all_names property
all_names: List[str]

A list of all names, including the primary name and aliases, in that order.

alternate property
alternate: Optional[AlternateLocus]

Gets the alternate locus for this sequence

is_alternate property
is_alternate: bool

True if there is an alternate locus defined, False otherwise

Functions
__post_init__
__post_init__() -> None

Any post initialization validation should go here

Source code in fgpyo/fasta/sequence_dictionary.py
def __post_init__(self) -> None:
    """Any post initialization validation should go here"""
    if self.length < 0:
        raise ValueError(f"Length must be >= 0 for '{self.name}'")
    if re.search(SEQUENCE_NAME_PATTERN, self.name) is None:
        raise ValueError(f"Illegal name: '{self.name}'")
    if Keys.SEQUENCE_NAME in self.attributes:
        raise ValueError(f"'{Keys.SEQUENCE_NAME}' should not given in the list of attributes")
    if Keys.SEQUENCE_LENGTH in self.attributes:
        raise ValueError(f"'{Keys.SEQUENCE_LENGTH}' should not given in the list of attributes")
from_sam staticmethod
from_sam(meta: Dict[Union[Keys, str], Any], index: int) -> SequenceMetadata

Builds a SequenceMetadata from a dictionary. The keys must include the sequence name (Keys.SEQUENCE_NAME) and length (Keys.SEQUENCE_LENGTH). All other keys from Keys will be stored in the resulting attributes.

Parameters:

Name Type Description Default
meta Dict[Union[Keys, str], Any]

the python dictionary with keys from Keys. This is typically the dictionary stored in the "SQ" level of the two-level dictionary returned by the pysam.AlignmentHeader#to_dict() method.

required
index int

the 0-based index to use for this sequence

required
Source code in fgpyo/fasta/sequence_dictionary.py
@staticmethod
def from_sam(meta: Dict[Union[Keys, str], Any], index: int) -> "SequenceMetadata":
    """Builds a `SequenceMetadata` from a dictionary.  The keys must include the sequence
    name (`Keys.SEQUENCE_NAME`) and length (`Keys.SEQUENCE_LENGTH`).  All other keys from
    `Keys` will be stored in the resulting attributes.

    Args:
        meta: the python dictionary with keys from `Keys`.  This is typically the dictionary
              stored in the `"SQ"` level of the two-level dictionary returned by the
              `pysam.AlignmentHeader#to_dict()` method.
        index: the 0-based index to use for this sequence
    """
    name = meta[Keys.SEQUENCE_NAME]
    length = meta[Keys.SEQUENCE_LENGTH]
    attributes = copy.deepcopy(meta)
    del attributes[Keys.SEQUENCE_NAME]
    del attributes[Keys.SEQUENCE_LENGTH]
    return SequenceMetadata(name=name, length=length, index=index, attributes=attributes)
same_as
same_as(other: SequenceMetadata) -> bool

Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s.

Source code in fgpyo/fasta/sequence_dictionary.py
def same_as(self, other: "SequenceMetadata") -> bool:
    """Returns true if the sequences share a common reference name (including aliases), have
    the same length, and the same MD5 if both have MD5s."""
    if self.length != other.length:
        return False
    elif self.name != other.name and other.name not in self.all_names:
        return False
    self_m5 = self.md5
    other_m5 = other.md5
    if self_m5 is None or other_m5 is None:
        return True
    else:
        return self_m5 == other_m5
to_sam
to_sam() -> Dict[str, Any]

Converts the sequence metadata to a dictionary equivalent to one item in the list of sequences from pysam.AlignmentHeader#to_dict()["SQ"].

Source code in fgpyo/fasta/sequence_dictionary.py
def to_sam(self) -> Dict[str, Any]:
    """Converts the sequence metadata to a dictionary equivalent to one item in the
    list of sequences from `pysam.AlignmentHeader#to_dict()["SQ"]`."""
    meta_dict: Dict[str, Any] = {
        f"{Keys.SEQUENCE_NAME}": self.name,
        f"{Keys.SEQUENCE_LENGTH}": self.length,
    }
    if len(self.attributes) > 0:
        meta_dict = {**meta_dict, **self.attributes}

    return meta_dict
Topology

Bases: StrEnum

Enumeration for the topology of reference sequences (SAM @SQ.TP)

Source code in fgpyo/fasta/sequence_dictionary.py
@unique
class Topology(StrEnum):
    """Enumeration for the topology of reference sequences (SAM @SQ.TP)"""

    LINEAR = "LINEAR"
    CIRCULAR = "CIRCULAR"
Modules

fastx

Zipping FASTX Files

Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics and can be achieved with the FastxZipped() context manager. The context manager facilitates opening of all input FASTA/FASTQ files and closing them after iteration is complete. For every iteration of FastxZipped(), a tuple of the next FASTX records are returned (of type pysam.FastxRecord()). An exception will be raised if any of the input files are malformed or truncated and if record names are not equivalent and in sync.

Importantly, this context manager is optimized for fast streaming read-only usage and, by default, any previous records saved while advancing the iterator will not be correct as the underlying pointer in memory will refer to the most recent record only, and not any past records. To preserve the state of all previously iterated records, set the parameter persist to True.

>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:  
...    for (r1, r2) in zipped:
...         print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT

Classes

FastxZipped

Bases: AbstractContextManager, Iterator[Tuple[FastxRecord, ...]]

A context manager that will lazily zip over any number of FASTA/FASTQ files.

Parameters:

Name Type Description Default
paths Union[Path, str]

Paths to the FASTX files to zip over.

()
persist bool

Whether to persist the state of previous records during iteration.

False
Source code in fgpyo/fastx/__init__.py
class FastxZipped(AbstractContextManager, Iterator[Tuple[FastxRecord, ...]]):
    """A context manager that will lazily zip over any number of FASTA/FASTQ files.

    Args:
        paths: Paths to the FASTX files to zip over.
        persist: Whether to persist the state of previous records during iteration.

    """

    def __init__(self, *paths: Union[Path, str], persist: bool = False) -> None:
        """Instantiate a `FastxZipped` context manager and iterator."""
        if len(paths) <= 0:
            raise ValueError(f"Must provide at least one FASTX to {self.__class__.__name__}")
        self._persist: bool = persist
        self._paths: Tuple[Union[Path, str], ...] = paths
        self._fastx = tuple(FastxFile(str(path), persist=self._persist) for path in self._paths)

    @staticmethod
    def _name_minus_ordinal(name: str) -> str:
        """Return the name of the FASTX record minus its ordinal suffix (e.g. "/1" or "/2")."""
        return name[: len(name) - 2] if len(name) >= 2 and name[-2] == "/" else name

    def __next__(self) -> Tuple[FastxRecord, ...]:
        """Return the next set of FASTX records from the zipped FASTX files."""
        records = tuple(next(handle, None) for handle in self._fastx)
        if all(record is None for record in records):
            raise StopIteration
        elif any(record is None for record in records):
            sequence_name: str = [record.name for record in records if record is not None][0]
            raise ValueError(
                "One or more of the FASTX files is truncated for sequence "
                + f"{self._name_minus_ordinal(sequence_name)}:\n\t"
                + "\n\t".join(
                    str(self._paths[i]) for i, record in enumerate(records) if record is None
                )
            )
        else:
            record_names: List[str] = [self._name_minus_ordinal(record.name) for record in records]
            if len(set(record_names)) != 1:
                raise ValueError(f"FASTX record names do not all match, found: {record_names}")
            return records

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        """Exit the `FastxZipped` context manager by closing all FASTX files."""
        self.close()
        if exc_type is not None:
            raise exc_type(exc_val).with_traceback(exc_tb)
        return None

    def close(self) -> None:
        """Close the `FastxZipped` context manager by closing all FASTX files."""
        for fastx in self._fastx:
            fastx.close()
Functions
__exit__
__exit__(exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException], exc_tb: Optional[TracebackType]) -> Optional[bool]

Exit the FastxZipped context manager by closing all FASTX files.

Source code in fgpyo/fastx/__init__.py
def __exit__(
    self,
    exc_type: Optional[Type[BaseException]],
    exc_val: Optional[BaseException],
    exc_tb: Optional[TracebackType],
) -> Optional[bool]:
    """Exit the `FastxZipped` context manager by closing all FASTX files."""
    self.close()
    if exc_type is not None:
        raise exc_type(exc_val).with_traceback(exc_tb)
    return None
__init__
__init__(*paths: Union[Path, str], persist: bool = False) -> None

Instantiate a FastxZipped context manager and iterator.

Source code in fgpyo/fastx/__init__.py
def __init__(self, *paths: Union[Path, str], persist: bool = False) -> None:
    """Instantiate a `FastxZipped` context manager and iterator."""
    if len(paths) <= 0:
        raise ValueError(f"Must provide at least one FASTX to {self.__class__.__name__}")
    self._persist: bool = persist
    self._paths: Tuple[Union[Path, str], ...] = paths
    self._fastx = tuple(FastxFile(str(path), persist=self._persist) for path in self._paths)
__next__
__next__() -> Tuple[FastxRecord, ...]

Return the next set of FASTX records from the zipped FASTX files.

Source code in fgpyo/fastx/__init__.py
def __next__(self) -> Tuple[FastxRecord, ...]:
    """Return the next set of FASTX records from the zipped FASTX files."""
    records = tuple(next(handle, None) for handle in self._fastx)
    if all(record is None for record in records):
        raise StopIteration
    elif any(record is None for record in records):
        sequence_name: str = [record.name for record in records if record is not None][0]
        raise ValueError(
            "One or more of the FASTX files is truncated for sequence "
            + f"{self._name_minus_ordinal(sequence_name)}:\n\t"
            + "\n\t".join(
                str(self._paths[i]) for i, record in enumerate(records) if record is None
            )
        )
    else:
        record_names: List[str] = [self._name_minus_ordinal(record.name) for record in records]
        if len(set(record_names)) != 1:
            raise ValueError(f"FASTX record names do not all match, found: {record_names}")
        return records
close
close() -> None

Close the FastxZipped context manager by closing all FASTX files.

Source code in fgpyo/fastx/__init__.py
def close(self) -> None:
    """Close the `FastxZipped` context manager by closing all FASTX files."""
    for fastx in self._fastx:
        fastx.close()

io

Module for reading and writing files

The functions in this module make it easy to:

  • check if a file exists and is writable
  • check if a file and its parent directories exist and are writable
  • check if a file exists and is readable
  • check if a path exists and is a directory
  • open an appropriate reader or writer based on the file extension
  • write items to a file, one per line
  • read lines from a file
fgpyo.io Examples:
>>> import fgpyo.io as fio
>>> from fgpyo.io import write_lines, read_lines
>>> from pathlib import Path

Assert that a path exists and is readable:

>>> tmp_dir = Path(getfixture("tmp_path"))
>>> path_flat: Path = tmp_dir / "example.txt"
>>> fio.assert_path_is_readable(path_flat)  
Traceback (most recent call last):
    ...
AssertionError: Cannot read non-existent path: ...

Write to and read from path:

>>> path_flat = tmp_dir / "example.txt"
>>> path_compressed = tmp_dir / "example.txt.gz"
>>> write_lines(path=path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path=path_compressed, lines_to_write=["gzip file", 10])

Read lines from a path into a generator:

>>> lines = read_lines(path=path_flat)
>>> next(lines)
'flat file'
>>> next(lines)
'10'
>>> lines = read_lines(path=path_compressed)
>>> next(lines)
'gzip file'
>>> next(lines)
'10'

Functions

assert_directory_exists
assert_directory_exists(path: Path) -> None

Asserts that a path exist and is a directory

Parameters:

Name Type Description Default
path Path

Path to check

required
Example

assert_directory_exists(path = Path("/example/directory/"))

Source code in fgpyo/io/__init__.py
def assert_directory_exists(path: Path) -> None:
    """Asserts that a path exist and is a directory

    Args:
        path: Path to check

    Example:
        assert_directory_exists(path = Path("/example/directory/"))
    """
    assert path.exists(), f"Path does not exist: {path}"
    assert path.is_dir(), f"Path exists but is not a directory: {path}"
assert_fasta_indexed
assert_fasta_indexed(fasta: Path, /, dictionary: bool = False, bwa: bool = False) -> None

Verify that a FASTA is readable and has the expected index files.

The existence of the FASTA index generated by samtools faidx will always be verified. The existence of the index files generated by samtools dict and bwa index may be optionally verified.

Parameters:

Name Type Description Default
fasta Path

Path to the FASTA file.

required
dictionary bool

If True, check for the index file generated by samtools dict ({fasta}.dict).

False
bwa bool

If True, check for the index files generated by bwa index ({fasta}.{suffix}, for all suffixes in ["amb", "ann", "bwt", "pac", "sa"]).

False

Raises:

Type Description
AssertionError

If the FASTA or any of the expected index files are missing or not readable.

Source code in fgpyo/io/__init__.py
def assert_fasta_indexed(
    fasta: Path,
    /,
    dictionary: bool = False,
    bwa: bool = False,
) -> None:
    """
    Verify that a FASTA is readable and has the expected index files.

    The existence of the FASTA index generated by `samtools faidx` will always be verified. The
    existence of the index files generated by `samtools dict` and `bwa index` may be optionally
    verified.

    Args:
        fasta: Path to the FASTA file.
        dictionary: If True, check for the index file generated by `samtools dict` (`{fasta}.dict`).
        bwa: If True, check for the index files generated by `bwa index` (`{fasta}.{suffix}`, for
            all suffixes in ["amb", "ann", "bwt", "pac", "sa"]).

    Raises:
        AssertionError: If the FASTA or any of the expected index files are missing or not readable.
    """
    fai_index = Path(f"{fasta}.fai")
    assert_path_is_readable(fai_index)

    if dictionary:
        dict_index = Path(f"{fasta}.dict")
        assert_path_is_readable(dict_index)

    if bwa:
        suffixes = ["amb", "ann", "bwt", "pac", "sa"]
        for suffix in suffixes:
            bwa_index = Path(f"{fasta}.{suffix}")
            assert_path_is_readable(bwa_index)
assert_path_is_readable
assert_path_is_readable(path: Path) -> None

Checks that file exists and returns True, else raises AssertionError

Parameters:

Name Type Description Default
path Path

a Path to check

required
Example

assert_file_exists(path = Path("some_file.csv"))

Source code in fgpyo/io/__init__.py
def assert_path_is_readable(path: Path) -> None:
    """Checks that file exists and returns True, else raises AssertionError

    Args:
        path: a Path to check

    Example:
        assert_file_exists(path = Path("some_file.csv"))
    """
    # stdin is readable
    if path == Path("/dev/stdin"):
        return

    assert path.exists(), f"Cannot read non-existent path: {path}"
    assert path.is_file(), f"Cannot read path because it is not a file: {path}"
    assert os.access(path, os.R_OK), f"Path exists but is not readable: {path}"
assert_path_is_writable
assert_path_is_writable(path: Path, parent_must_exist: bool = True) -> None

Assert that a filepath is writable.

Specifically: - If the file exists then it must also be writable. - Else if the path is not a file and parent_must_exist is true, then assert that the parent directory exists and is writable. - Else if the path is not a directory and parent_must_exist is false, then look at each parent directory until one is found that exists and is writable.

Parameters:

Name Type Description Default
path Path

Path to check

required
parent_must_exist bool

If True, the file's parent directory must exist. Otherwise, at least one directory in the path's components must exist.

True

Raises:

Type Description
AssertionError

If any of the above conditions are not met.

Example

assert_path_is_writable(path = Path("example.txt"))

Source code in fgpyo/io/__init__.py
def assert_path_is_writable(path: Path, parent_must_exist: bool = True) -> None:
    """
    Assert that a filepath is writable.

    Specifically:
    - If the file exists then it must also be writable.
    - Else if the path is not a file and `parent_must_exist` is true, then assert that the parent
      directory exists and is writable.
    - Else if the path is not a directory and `parent_must_exist` is false, then look at each parent
      directory until one is found that exists and is writable.

    Args:
        path: Path to check
        parent_must_exist: If True, the file's parent directory must exist. Otherwise, at least one
            directory in the path's components must exist.

    Raises:
        AssertionError: If any of the above conditions are not met.

    Example:
        assert_path_is_writable(path = Path("example.txt"))
    """
    # stdout is writable
    if path == Path("/dev/stdout"):
        return

    # If path exists, it must be a writable file
    if path.exists():
        assert path.is_file(), f"Cannot read path because it is not a file: {path}"
        assert os.access(path, os.W_OK), f"File exists but is not writable: {path}"

    # Else if file doesn't exist and parent_must_exist is True then check
    # that path.absolute().parent exists, is a directory and is writable
    elif parent_must_exist:
        parent = path.absolute().parent
        assert parent.exists(), f"Parent directory does not exist: {parent}"
        assert parent.is_dir(), f"Parent directory exists but is not a directory: {parent}"
        assert os.access(parent, os.W_OK), f"Parent directory exists but is not writable: {parent}"

    # Else if file doesn't exist and parent_must_exist is False, test parent until
    # you find the first extant path, and check that it is a directory and is writable.
    else:
        for parent in path.absolute().parents:
            if parent.exists():
                assert os.access(parent, os.W_OK), f"Parent directory is not writable: {parent}"
                break
        else:
            raise AssertionError(f"No parent directories exist for: {path}")
assert_path_is_writeable
assert_path_is_writeable(path: Path, parent_must_exist: bool = True) -> None

A deprecated alias for assert_path_is_writable().

Source code in fgpyo/io/__init__.py
def assert_path_is_writeable(path: Path, parent_must_exist: bool = True) -> None:
    """
    A deprecated alias for `assert_path_is_writable()`.
    """
    warnings.warn(
        "assert_path_is_writeable is deprecated, use assert_path_is_writable instead",
        DeprecationWarning,
        stacklevel=2,
    )

    assert_path_is_writable(path=path, parent_must_exist=parent_must_exist)
read_lines
read_lines(path: Path, strip: bool = False, threads: Optional[int] = None) -> Iterator[str]

Takes a path and reads each line into a generator, removing line terminators along the way. By default, only line terminators (CR/LF) are stripped. The strip parameter may be used to strip both leading and trailing whitespace from each line.

Parameters:

Name Type Description Default
path Path

Path to read from

required
strip bool

True to strip lines of all leading and trailing whitespace, False to only remove trailing CR/LF characters.

False
threads Optional[int]

the number of threads to use when decompressing gzip files

None
Example

import fgpyo.io as fio read_back = fio.read_lines(path)

Source code in fgpyo/io/__init__.py
def read_lines(path: Path, strip: bool = False, threads: Optional[int] = None) -> Iterator[str]:
    """Takes a path and reads each line into a generator, removing line terminators
    along the way. By default, only line terminators (CR/LF) are stripped.  The `strip`
    parameter may be used to strip both leading and trailing whitespace from each line.

    Args:
        path: Path to read from
        strip: True to strip lines of all leading and trailing whitespace,
            False to only remove trailing CR/LF characters.
        threads: the number of threads to use when decompressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> read_back = fio.read_lines(path)

    """
    with to_reader(path=path, threads=threads) as reader:
        if strip:
            for line in reader:
                yield line.strip()
        else:
            for line in reader:
                yield line.rstrip("\r\n")
redirect_to_dev_null
redirect_to_dev_null(file_num: int) -> Generator[None, None, None]

A context manager that redirects output of file handle to /dev/null

Parameters:

Name Type Description Default
file_num int

number of filehandle to redirect.

required
Source code in fgpyo/io/__init__.py
@contextmanager
def redirect_to_dev_null(file_num: int) -> Generator[None, None, None]:
    """A context manager that redirects output of file handle to /dev/null

    Args:
        file_num: number of filehandle to redirect.
    """
    f_devnull = save_fd = None
    try:
        # open /dev/null for writing
        f_devnull = os.open(os.devnull, os.O_RDWR)
        # save old file descriptor and redirect stderr to /dev/null
        save_fd = os.dup(file_num)
        os.dup2(f_devnull, file_num)
        yield
    finally:
        # restore file descriptor and close devnull
        if save_fd is not None:
            os.dup2(save_fd, file_num)
            os.close(save_fd)
        if f_devnull is not None:
            os.close(f_devnull)
suppress_stderr
suppress_stderr() -> Generator[None, None, None]

A context manager that redirects output of stderr to /dev/null

Source code in fgpyo/io/__init__.py
@contextmanager
def suppress_stderr() -> Generator[None, None, None]:
    """A context manager that redirects output of stderr to /dev/null"""
    with redirect_to_dev_null(file_num=sys.stderr.fileno()):
        yield
to_reader
to_reader(path: Path, threads: Optional[int] = None) -> TextIOWrapper

Opens a Path for reading and based on extension uses open() or gzip_ng.open()

Parameters:

Name Type Description Default
path Path

Path to read from

required
threads Optional[int]

the number of threads to use when decompressing gzip files

None
Example

import fgpyo.io as fio reader = fio.to_reader(path=Path("reader.txt")).readlines().close()

Source code in fgpyo/io/__init__.py
def to_reader(path: Path, threads: Optional[int] = None) -> TextIOWrapper:
    """Opens a Path for reading and based on extension uses open() or gzip_ng.open()

    Args:
        path: Path to read from
        threads: the number of threads to use when decompressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> reader = fio.to_reader(path=Path("reader.txt"))
        >>> reader.readlines()
        >>> reader.close()

    """
    if path.suffix in COMPRESSED_FILE_EXTENSIONS:
        if threads is None:
            reader = gzip_ng.open(path, mode="rb")  # type: ignore[no-untyped-call]
        else:
            reader = gzip_ng_threaded.open(path, mode="rb", threads=threads)  # type: ignore[no-untyped-call]
        return TextIOWrapper(cast(IO[bytes], reader), encoding="utf-8")
    else:
        return path.open(mode="r")
to_writer
to_writer(path: Path, append: bool = False, threads: Optional[int] = None) -> TextIOWrapper

Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open()

Parameters:

Name Type Description Default
path Path

Path to write (or append) to

required
append bool

open the file for appending

False
threads Optional[int]

the number of threads to use when compressing gzip files

None
Example

import fgpyo.io as fio writer = fio.to_writer(path=Path("writer.txt")).write("something\n").close()

Source code in fgpyo/io/__init__.py
def to_writer(path: Path, append: bool = False, threads: Optional[int] = None) -> TextIOWrapper:
    """Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open()

    Args:
        path: Path to write (or append) to
        append: open the file for appending
        threads: the number of threads to use when compressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> writer = fio.to_writer(path=Path("writer.txt"))
        >>> writer.write("something\\n")
        >>> writer.close()

    """
    mode_prefix: str = "a" if append else "w"

    if path.suffix in COMPRESSED_FILE_EXTENSIONS:
        if threads is None:
            reader = gzip_ng.open(path, mode=mode_prefix + "b")  # type: ignore[no-untyped-call]
        else:
            reader = gzip_ng_threaded.open(path, mode=mode_prefix + "b", threads=threads)  # type: ignore[no-untyped-call]
        return TextIOWrapper(
            cast(IO[bytes], reader),
            encoding="utf-8",
        )
    else:
        # NB: the `cast` here is necessary because `path.open()` may return
        # other types, depending on the specified `mode`.
        # Within the scope of this function, `mode_prefix` is guaranteed to be
        # either "w" or "a", both of which result in a `TextIOWrapper`, but
        # mypy can't follow that logic.
        return cast(TextIOWrapper, path.open(mode=mode_prefix))
write_lines
write_lines(path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: Optional[int] = None) -> None

Writes (or appends) a file with one line per item in provided iterable

Parameters:

Name Type Description Default
path Path

Path to write (or append) to

required
lines_to_write Iterable[Any]

items to write (or append) to file

required
append bool

open the file for appending

False
threads Optional[int]

the number of threads to use when compressing gzip files

None
Example

lines: List[Any] = ["things to write", 100] path_to_write_to: Path = Path("file_to_write_to.txt") fio.write_lines(path = path_to_write_to, lines_to_write = lines)

Source code in fgpyo/io/__init__.py
def write_lines(
    path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: Optional[int] = None
) -> None:
    """Writes (or appends) a file with one line per item in provided iterable

    Args:
        path: Path to write (or append) to
        lines_to_write: items to write (or append) to file
        append: open the file for appending
        threads: the number of threads to use when compressing gzip files

    Example:
        lines: List[Any] = ["things to write", 100]
        path_to_write_to: Path = Path("file_to_write_to.txt")
        fio.write_lines(path = path_to_write_to, lines_to_write = lines)
    """
    with to_writer(path=path, append=append, threads=threads) as writer:
        for line in lines_to_write:
            writer.write(str(line))
            writer.write("\n")

platform

Modules

illumina
Methods for working with Illumina-specific UMIs in SAM files

The functions in this module make it easy to:

  • check whether a UMI is valid
  • extract UMI(s) from an Illumina-style read name
  • copy a UMI from an alignment's read name to its RX SAM tag
Attributes
SAM_UMI_DELIMITER module-attribute
SAM_UMI_DELIMITER: str = '-'

Multiple UMI delimiter, which SAM specification recommends should be a hyphen; see specification here: https://samtools.github.io/hts-specs/SAMtags.pdf

Functions
copy_umi_from_read_name
copy_umi_from_read_name(rec: AlignedSegment, strict: bool = False, remove_umi: bool = False) -> bool

Copy a UMI from an alignment's read name to its RX SAM tag. UMI will not be copied to RX tag if invalid.

Parameters:

Name Type Description Default
rec AlignedSegment

The alignment record to update.

required
strict bool

If True and UMI invalid, will throw an exception

False
remove_umi bool

If True, the UMI will be removed from the read name after copying.

False

Returns:

Type Description
bool

True if the UMI was successfully extracted, False if otherwise.

Raises:

Type Description
ValueError

If the read name does not end with a valid UMI.

ValueError

If the record already has a populated RX SAM tag.

Source code in fgpyo/platform/illumina.py
def copy_umi_from_read_name(
    rec: AlignedSegment, strict: bool = False, remove_umi: bool = False
) -> bool:
    """
    Copy a UMI from an alignment's read name to its `RX` SAM tag. UMI will not be copied to RX
    tag if invalid.

    Args:
        rec: The alignment record to update.
        strict: If `True` and UMI invalid, will throw an exception
        remove_umi: If `True`, the UMI will be removed from the read name after copying.

    Returns:
        `True` if the UMI was successfully extracted, False if otherwise.

    Raises:
        ValueError: If the read name does not end with a valid UMI.
        ValueError: If the record already has a populated `RX` SAM tag.
    """

    umi = extract_umis_from_read_name(
        read_name=rec.query_name,
        strict=strict,
        umi_delimiter=_ILLUMINA_READ_NAME_DELIMITER,
    )
    if umi is not None:
        if rec.has_tag("RX"):
            raise ValueError(f"Record {rec.query_name} already has a populated RX tag")
        rec.set_tag(tag="RX", value=umi)
        if remove_umi:
            last_index = rec.query_name.rfind(_ILLUMINA_READ_NAME_DELIMITER)
            rec.query_name = rec.query_name[:last_index] if last_index != -1 else rec.query_name
        return True
    elif strict:
        raise ValueError(f"Invalid UMI {umi} extracted from {rec.query_name}")
    else:
        return False
extract_umis_from_read_name
extract_umis_from_read_name(read_name: str, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER, strict: bool = False) -> Optional[str]

Extract UMI(s) from an Illumina-style read name.

The UMI is expected to be the final component of the read name, delimited by the read_name_delimiter. Multiple UMIs may be present, delimited by the umi_delimiter. This delimiter will be replaced by the SAM-standard -.

Parameters:

Name Type Description Default
read_name str

The read name to extract the UMI from.

required
read_name_delimiter str

The delimiter separating the components of the read name.

_ILLUMINA_READ_NAME_DELIMITER
umi_delimiter str

The delimiter separating multiple UMIs.

_ILLUMINA_UMI_DELIMITER
strict bool

If strict is True, the read name must contain either 7 or 8 colon-separated segments. The UMI is assumed to be the last one in the case of 8 segments and None in the case of 7 segments. strict requires the UMI to be valid and consistent with Illumina's allowed UMI characters. If strict is False, the last segment is returned so long as it appears to be a valid UMI.

False

Returns:

Type Description
Optional[str]

The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are

Optional[str]

returned in a single string, separated by a hyphen (-).

Raises:

Type Description
ValueError

If the read name does not end with a valid UMI.

Source code in fgpyo/platform/illumina.py
def extract_umis_from_read_name(
    read_name: str,
    read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER,
    umi_delimiter: str = _ILLUMINA_UMI_DELIMITER,
    strict: bool = False,
) -> Optional[str]:
    """Extract UMI(s) from an Illumina-style read name.

    The UMI is expected to be the final component of the read name, delimited by the
    `read_name_delimiter`. Multiple UMIs may be present, delimited by the `umi_delimiter`. This
    delimiter will be replaced by the SAM-standard `-`.

    Args:
        read_name: The read name to extract the UMI from.
        read_name_delimiter: The delimiter separating the components of the read name.
        umi_delimiter: The delimiter separating multiple UMIs.
        strict: If `strict` is `True`, the read name must contain either 7 or 8 colon-separated
            segments. The UMI is assumed to be the last one in the case of 8 segments and `None`
            in the case of 7 segments. `strict` requires the UMI to be valid and consistent with
            Illumina's allowed UMI characters. If `strict` is `False`, the last segment is returned
            so long as it appears to be a valid UMI.

    Returns:
        The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are
        returned in a single string, separated by a hyphen (`-`).

    Raises:
        ValueError: If the read name does not end with a valid UMI.
    """
    if strict:
        colons = read_name.count(":")
        if colons == 6:  # number of fields is 7
            return None
        elif colons != 7:
            raise ValueError(
                f"Trying to extract UMIs from read with {colons + 1} parts "
                f"(7 or 8 expected): {read_name}"
            )
    raw_umi = read_name.split(read_name_delimiter)[-1]
    # Check each UMI individually
    umis = raw_umi.split(umi_delimiter)
    # Strip the "r" from rev-comped UMIs
    # (NB: for consistency with UMI_tools, the UMI is not revcomped)
    umis = [umi.lstrip("r") for umi in umis]

    invalid_umis = [umi for umi in umis if not _is_valid_umi(umi)]
    if len(invalid_umis) == 0:
        return SAM_UMI_DELIMITER.join(umis)
    elif strict:
        raise ValueError(
            f"Invalid UMIs found in read name: {read_name}",
            f"  (Invalid UMIs: {', '.join(invalid_umis)})",
        )
    else:
        return None

read_structure

Classes for representing Read Structures

A Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Examples
>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
['75T', '8B', '75T']
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
...   ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
...   print(str(ex))
Read structure missing length information: 23T2T[T]23T

Attributes

ANY_LENGTH_CHAR module-attribute
ANY_LENGTH_CHAR: str = '+'

A character that can be put in place of a number in a read structure to mean "0 or more bases".

Classes

ReadSegment

Encapsulates all the information about a segment within a read structure. A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.

Attributes:

Name Type Description
offset int

The offset of the read segment in the read.

length Optional[int]

The length of the segment, or None if it is variable length.

kind SegmentType

The kind of read segment.

Source code in fgpyo/read_structure.py
@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class ReadSegment:
    """Encapsulates all the information about a segment within a read structure. A segment can
    either have a definite length, in which case length must be Some(Int), or an indefinite length
    (can be any length, 0 or more) in which case length must be None.

    Attributes:
        offset: The offset of the read segment in the read.
        length: The length of the segment, or None if it is variable length.
        kind: The kind of read segment.

    """

    offset: int
    length: Optional[int]
    kind: SegmentType

    @property
    def has_fixed_length(self) -> bool:
        """True if the read segment has a defined length."""
        return self.length is not None

    @property
    def fixed_length(self) -> int:
        """The fixed length if there is one. Throws an exception on segments without fixed
        lengths!"""
        if not self.has_fixed_length:
            raise AttributeError(f"fixed_length called on a variable length segment: {self}")
        return self.length

    def extract(self, bases: str) -> SubReadWithoutQuals:
        """Gets the bases associated with this read segment."""
        end = self._calculate_end(bases)
        return SubReadWithoutQuals(bases=bases[self.offset : end], segment=self._resized(end))

    def extract_with_quals(self, bases: str, quals: str) -> SubReadWithQuals:
        """Gets the bases and qualities associated with this read segment."""
        assert len(bases) == len(quals), f"Bases and quals differ in length: {bases} {quals}"
        end = self._calculate_end(bases)
        return SubReadWithQuals(
            bases=bases[self.offset : end],
            quals=quals[self.offset : end],
            segment=self._resized(end),
        )

    def _calculate_end(self, bases: str) -> int:
        """Checks some requirements and then calculates the end position for the segment for the
        given read."""
        bases_len = len(bases)
        assert bases_len >= self.offset, f"Read ends before the segment starts: {self}"
        assert self.length is None or bases_len >= self.offset + self.length, (
            f"Read ends before end of segment: {self}"
        )
        if self.has_fixed_length:
            return min(self.offset + self.fixed_length, bases_len)
        else:
            return bases_len

    def _resized(self, end: int) -> "ReadSegment":
        new_length = end - self.offset
        if self.has_fixed_length and self.fixed_length == new_length:
            return self
        else:
            return attr.evolve(self, length=new_length)

    def __str__(self) -> str:
        if self.has_fixed_length:
            return f"{self.length}{self.kind.value}"
        else:
            return f"{ANY_LENGTH_CHAR}{self.kind.value}"
Attributes
fixed_length property
fixed_length: int

The fixed length if there is one. Throws an exception on segments without fixed lengths!

has_fixed_length property
has_fixed_length: bool

True if the read segment has a defined length.

Functions
extract
extract(bases: str) -> SubReadWithoutQuals

Gets the bases associated with this read segment.

Source code in fgpyo/read_structure.py
def extract(self, bases: str) -> SubReadWithoutQuals:
    """Gets the bases associated with this read segment."""
    end = self._calculate_end(bases)
    return SubReadWithoutQuals(bases=bases[self.offset : end], segment=self._resized(end))
extract_with_quals
extract_with_quals(bases: str, quals: str) -> SubReadWithQuals

Gets the bases and qualities associated with this read segment.

Source code in fgpyo/read_structure.py
def extract_with_quals(self, bases: str, quals: str) -> SubReadWithQuals:
    """Gets the bases and qualities associated with this read segment."""
    assert len(bases) == len(quals), f"Bases and quals differ in length: {bases} {quals}"
    end = self._calculate_end(bases)
    return SubReadWithQuals(
        bases=bases[self.offset : end],
        quals=quals[self.offset : end],
        segment=self._resized(end),
    )
ReadStructure

Bases: Iterable[ReadSegment]

Describes the structure of a give read. A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.

Attributes:

Name Type Description
segments Tuple[ReadSegment, ...]

The segments composing the read structure

Source code in fgpyo/read_structure.py
@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class ReadStructure(Iterable[ReadSegment]):
    """Describes the structure of a give read.  A read contains one or more read segments. A read
    segment describes a contiguous stretch of bases of the same type (ex. template bases) of some
    length and some offset from the start of the read.

    Attributes:
         segments: The segments composing the read structure

    """

    segments: Tuple[ReadSegment, ...]

    @property
    def _min_length(self) -> int:
        """The minimum length read that this read structure can process"""
        return sum(segment.length for segment in self.segments if segment.has_fixed_length)

    @property
    def has_fixed_length(self) -> bool:
        """True if the ReadStructure has a fixed (i.e. non-variable) length"""
        return self.segments[-1].has_fixed_length

    @property
    def fixed_length(self) -> int:
        """The fixed length if there is one. Throws an exception on segments without fixed
        lengths!"""
        if not self.has_fixed_length:
            raise AttributeError(f"fixed_length called on a variable length read structure: {self}")
        return self._min_length

    @property
    def length(self) -> int:
        """Length is defined as the number of segments (not bases!) in the read structure"""
        return len(self.segments)

    def with_variable_last_segment(self) -> "ReadStructure":
        """Generates a new ReadStructure that is the same as this one except that the last segment
        has undefined length"""
        last_segment = self.segments[-1]
        if not last_segment.has_fixed_length:
            return self
        else:
            last_segment = attr.evolve(last_segment, length=None)
            return ReadStructure(segments=self.segments[:-1] + (last_segment,))

    def extract(self, bases: str) -> Tuple[SubReadWithoutQuals, ...]:
        """Splits the given bases into tuples with its associated read segment."""
        return tuple([segment.extract(bases=bases) for segment in self])

    def extract_with_quals(self, bases: str, quals: str) -> Tuple[SubReadWithQuals, ...]:
        """Splits the given bases and qualities into triples with its associated read segment."""
        return tuple([segment.extract_with_quals(bases=bases, quals=quals) for segment in self])

    def segments_by_kind(self, kind: SegmentType) -> Tuple[ReadSegment, ...]:
        """Returns just the segments of a given kind."""
        return tuple([segment for segment in self if segment.kind == kind])

    def template_segments(self) -> Tuple[ReadSegment, ...]:
        return self.segments_by_kind(kind=SegmentType.Template)

    def sample_barcode_segments(self) -> Tuple[ReadSegment, ...]:
        return self.segments_by_kind(kind=SegmentType.SampleBarcode)

    def molecular_barcode_segments(self) -> Tuple[ReadSegment, ...]:
        return self.segments_by_kind(kind=SegmentType.MolecularBarcode)

    def cell_barcode_segments(self) -> Tuple[ReadSegment, ...]:
        return self.segments_by_kind(kind=SegmentType.CellBarcode)

    def skip_segments(self) -> Tuple[ReadSegment, ...]:
        return self.segments_by_kind(kind=SegmentType.Skip)

    def __iter__(self) -> Iterator[ReadSegment]:
        return iter(self.segments)

    def __str__(self) -> str:
        return "".join(str(s) for s in self.segments)

    def __len__(self) -> int:
        return self.length

    def __getitem__(self, index: int) -> ReadSegment:
        return self.segments[index]

    @classmethod
    def from_segments(
        cls, segments: Tuple[ReadSegment, ...], reset_offsets: bool = False
    ) -> "ReadStructure":
        """Creates a new ReadStructure, optionally resetting the offsets on each of the segments"""
        # Check that none but the last segment has an indefinite length
        assert all(s.has_fixed_length for s in segments[:-1]), (
            f"Variable length ({ANY_LENGTH_CHAR}) can only be used in the last segment: "
            + "".join(str(s) for s in segments)
        )

        if reset_offsets:
            off = 0
            segs = []
            for seg in segments:
                seg = attr.evolve(seg, offset=off)
                off += seg.length if seg.has_fixed_length else 0
                segs.append(seg)
            segments = tuple(segs)

        assert all(s.length is None or s.length > 0 for s in segments), (
            "Read structure contained zero length segments" + "".join(str(s) for s in segments)
        )

        return ReadStructure(segments=segments)

    @classmethod
    def from_string(cls, segments: str) -> "ReadStructure":
        # Check that none but the last segment has an indefinite length
        tidied = "".join(ch for ch in segments.upper() if not ch.isspace())
        return cls.from_segments(segments=cls._from_string(string=tidied), reset_offsets=True)

    @classmethod
    def _from_string(cls, string: str) -> Tuple[ReadSegment, ...]:
        index = 0
        segments: List[ReadSegment] = []
        while index < len(string):
            # tash the beginning position of our parsing so we can highlight what we're having
            # trouble with
            parse_index = index

            seg_length: Optional[int] = None
            # Parse out the length segment which many be 1 or more digits or the AnyLengthChar
            if string[index] == ANY_LENGTH_CHAR:
                index += 1
                seg_length = None
            elif string[index].isdigit():
                seg_length = 0
                while index < len(string) and string[index].isdigit():
                    seg_length = (seg_length * 10) + int(string[index])
                    index += 1
            else:
                cls._invalid(
                    msg="Read structure missing length information",
                    rs=string,
                    start=parse_index,
                    end=parse_index + 1,
                )

            # Parse out the operator and make a segment
            if index == len(string):
                cls._invalid(
                    msg="Read structure with invalid segment",
                    rs=string,
                    start=parse_index,
                    end=index,
                )
            code = string[index]
            index += 1
            kind: SegmentType
            try:
                kind = SegmentType(code)
            except ValueError:
                cls._invalid(
                    msg="Read structure segment had unknown type",
                    rs=string,
                    start=parse_index,
                    end=parse_index + 1,
                )
            segments.append(ReadSegment(offset=0, length=seg_length, kind=kind))

        return tuple(segments)

    @classmethod
    def _invalid(cls, msg: str, rs: str, start: int, end: int) -> None:
        """Inserts square brackets around the characters in the read structure that are causing the
        error."""
        prefix = rs[:start]
        error = rs[start:end]
        suffix = "" if end == len(rs) else rs[end:]
        raise ValueError(f"{msg}: {prefix}[{error}]{suffix}")
Attributes
fixed_length property
fixed_length: int

The fixed length if there is one. Throws an exception on segments without fixed lengths!

has_fixed_length property
has_fixed_length: bool

True if the ReadStructure has a fixed (i.e. non-variable) length

length property
length: int

Length is defined as the number of segments (not bases!) in the read structure

Functions
extract
extract(bases: str) -> Tuple[SubReadWithoutQuals, ...]

Splits the given bases into tuples with its associated read segment.

Source code in fgpyo/read_structure.py
def extract(self, bases: str) -> Tuple[SubReadWithoutQuals, ...]:
    """Splits the given bases into tuples with its associated read segment."""
    return tuple([segment.extract(bases=bases) for segment in self])
extract_with_quals
extract_with_quals(bases: str, quals: str) -> Tuple[SubReadWithQuals, ...]

Splits the given bases and qualities into triples with its associated read segment.

Source code in fgpyo/read_structure.py
def extract_with_quals(self, bases: str, quals: str) -> Tuple[SubReadWithQuals, ...]:
    """Splits the given bases and qualities into triples with its associated read segment."""
    return tuple([segment.extract_with_quals(bases=bases, quals=quals) for segment in self])
from_segments classmethod
from_segments(segments: Tuple[ReadSegment, ...], reset_offsets: bool = False) -> ReadStructure

Creates a new ReadStructure, optionally resetting the offsets on each of the segments

Source code in fgpyo/read_structure.py
@classmethod
def from_segments(
    cls, segments: Tuple[ReadSegment, ...], reset_offsets: bool = False
) -> "ReadStructure":
    """Creates a new ReadStructure, optionally resetting the offsets on each of the segments"""
    # Check that none but the last segment has an indefinite length
    assert all(s.has_fixed_length for s in segments[:-1]), (
        f"Variable length ({ANY_LENGTH_CHAR}) can only be used in the last segment: "
        + "".join(str(s) for s in segments)
    )

    if reset_offsets:
        off = 0
        segs = []
        for seg in segments:
            seg = attr.evolve(seg, offset=off)
            off += seg.length if seg.has_fixed_length else 0
            segs.append(seg)
        segments = tuple(segs)

    assert all(s.length is None or s.length > 0 for s in segments), (
        "Read structure contained zero length segments" + "".join(str(s) for s in segments)
    )

    return ReadStructure(segments=segments)
segments_by_kind
segments_by_kind(kind: SegmentType) -> Tuple[ReadSegment, ...]

Returns just the segments of a given kind.

Source code in fgpyo/read_structure.py
def segments_by_kind(self, kind: SegmentType) -> Tuple[ReadSegment, ...]:
    """Returns just the segments of a given kind."""
    return tuple([segment for segment in self if segment.kind == kind])
with_variable_last_segment
with_variable_last_segment() -> ReadStructure

Generates a new ReadStructure that is the same as this one except that the last segment has undefined length

Source code in fgpyo/read_structure.py
def with_variable_last_segment(self) -> "ReadStructure":
    """Generates a new ReadStructure that is the same as this one except that the last segment
    has undefined length"""
    last_segment = self.segments[-1]
    if not last_segment.has_fixed_length:
        return self
    else:
        last_segment = attr.evolve(last_segment, length=None)
        return ReadStructure(segments=self.segments[:-1] + (last_segment,))
SegmentType

Bases: Enum

The type of segments that can show up in a read structure

Source code in fgpyo/read_structure.py
@enum.unique
class SegmentType(enum.Enum):
    """The type of segments that can show up in a read structure"""

    Template = "T"
    """The segment type for template bases."""

    SampleBarcode = "B"
    """The segment type for sample barcode bases."""

    MolecularBarcode = "M"
    """The segment type for molecular barcode bases."""

    CellBarcode = "C"
    """The segment type for cell barcode bases."""

    Skip = "S"
    """The segment type for bases that need to be skipped."""

    def __str__(self) -> str:
        return self.value
Attributes
CellBarcode class-attribute instance-attribute
CellBarcode = 'C'

The segment type for cell barcode bases.

MolecularBarcode class-attribute instance-attribute
MolecularBarcode = 'M'

The segment type for molecular barcode bases.

SampleBarcode class-attribute instance-attribute
SampleBarcode = 'B'

The segment type for sample barcode bases.

Skip class-attribute instance-attribute
Skip = 'S'

The segment type for bases that need to be skipped.

Template class-attribute instance-attribute
Template = 'T'

The segment type for template bases.

SubReadWithQuals

Contains the bases and qualities that correspond to the given read segment

Source code in fgpyo/read_structure.py
@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class SubReadWithQuals:
    """Contains the bases and qualities that correspond to the given read segment"""

    bases: str
    """The sub-read bases that correspond to the given read segment."""

    quals: str
    """The sub-read base qualities that correspond to the given read segment."""

    segment: "ReadSegment"
    """The segment of the read structure that describes this sub-read."""

    @property
    def kind(self) -> SegmentType:
        """The kind of read segment that corresponds to this sub-read."""
        return self.segment.kind
Attributes
bases instance-attribute
bases: str

The sub-read bases that correspond to the given read segment.

kind property

The kind of read segment that corresponds to this sub-read.

quals instance-attribute
quals: str

The sub-read base qualities that correspond to the given read segment.

segment instance-attribute
segment: ReadSegment

The segment of the read structure that describes this sub-read.

SubReadWithoutQuals

Contains the bases that correspond to the given read segment.

Source code in fgpyo/read_structure.py
@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class SubReadWithoutQuals:
    """Contains the bases that correspond to the given read segment."""

    bases: str
    """The sub-read bases that correspond to the given read segment."""

    segment: "ReadSegment"
    """The segment of the read structure that describes this sub-read."""

    @property
    def kind(self) -> SegmentType:
        """The kind of read segment that corresponds to this sub-read."""
        return self.segment.kind
Attributes
bases instance-attribute
bases: str

The sub-read bases that correspond to the given read segment.

kind property

The kind of read segment that corresponds to this sub-read.

segment instance-attribute
segment: ReadSegment

The segment of the read structure that describes this sub-read.

sam

Utility Classes and Methods for SAM/BAM

This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.

Motivation for Reader and Writer methods

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

  1. Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
  2. Makes the requirement to provide a header when opening a file for writing more explicit.
  3. Adds support for pathlib.Path.
  4. Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
  5. An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.
Examples of Opening a SAM/BAM for Reading or Writing

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType() for the supported file types.

>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader("/path/to/sample.bam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:  
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.query_name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }  
>>> with writer(path="/path/to/sample.bam", header=header) as fh:  
...     pass  # do something
Examples of Manipulating Cigars

Creating a Cigar from a pysam.AlignedSegment.

>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:  
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str().

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
Traceback (most recent call last):
    ...
fgpyo.sam.CigarParsingException: Malformed cigar: 10M5[U]

The cigar contains a tuple of CigarElement()s. Each element contains the cigar operator (CigarOp()) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Any particular element can be accessed directly via .elements with its index (and works with negative indexes and slices):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> cigar.elements[0].length
50
>>> cigar.elements[1].operator
<CigarOp.D: (2, 'D', False, True)>
>>> cigar.elements[-1].operator
<CigarOp.S: (4, 'S', True, False)>
>>> tuple(x.operator.character for x in cigar.elements[1:3])
('D', 'M')
>>> tuple(x.operator.character for x in cigar.elements[-2:])
('I', 'S')
Examples of parsing the SA tag and individual supplementary alignments
>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1'
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']

Attributes

DefaultProperlyPairedOrientations module-attribute
DefaultProperlyPairedOrientations: set[PairOrientation] = {FR}

The default orientations for properly paired reads.

NO_QUERY_QUALITIES module-attribute
NO_QUERY_QUALITIES: array = qualitystring_to_array(STRING_PLACEHOLDER)

The quality array corresponding to an unavailable query quality string ("*").

NO_REF_INDEX module-attribute
NO_REF_INDEX: int = -1

The reference index to use to indicate no reference in SAM/BAM.

NO_REF_NAME module-attribute
NO_REF_NAME: str = STRING_PLACEHOLDER

The reference name to use to indicate no reference in SAM/BAM.

NO_REF_POS module-attribute
NO_REF_POS: int = -1

The reference position to use to indicate no position in SAM/BAM.

STRING_PLACEHOLDER module-attribute
STRING_PLACEHOLDER: str = '*'

The value to use when a string field's information is unavailable.

SamPath module-attribute
SamPath = Union[IO[Any], Path, str]

The valid base classes for opening a SAM/BAM/CRAM file.

Classes

Cigar

Class representing a cigar string.

Attributes:

Name Type Description
- elements (Tuple[CigarElement, ...]

zero or more cigar elements

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, slots=True, auto_attribs=True)
class Cigar:
    """Class representing a cigar string.

    Attributes:
        - elements (Tuple[CigarElement, ...]): zero or more cigar elements
    """

    elements: Tuple[CigarElement, ...] = ()

    @classmethod
    def from_cigartuples(cls, cigartuples: Optional[List[Tuple[int, int]]]) -> "Cigar":
        """Returns a Cigar from a list of tuples returned by pysam.

        Each tuple denotes the operation and length.  See
        [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
        various operators.  If None is given, returns an empty Cigar.
        """
        if cigartuples is None or cigartuples == []:
            return Cigar()
        try:
            elements = []
            for code, length in cigartuples:
                operator = CigarOp.from_code(code)
                elements.append(CigarElement(length, operator))
            return Cigar(tuple(elements))
        except Exception as ex:
            raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex

    @classmethod
    def _pretty_cigarstring_exception(cls, cigarstring: str, index: int) -> CigarParsingException:
        """Raises an exception highlighting the malformed character"""
        prefix = cigarstring[:index]
        character = cigarstring[index] if index < len(cigarstring) else ""
        suffix = cigarstring[index + 1 :]
        pretty_cigarstring = f"{prefix}[{character}]{suffix}"
        message = f"Malformed cigar: {pretty_cigarstring}"
        return CigarParsingException(message)

    @classmethod
    def from_cigarstring(cls, cigarstring: str) -> "Cigar":
        """Constructs a Cigar from a string returned by pysam.

        If "*" is given, returns an empty Cigar.
        """
        if cigarstring == "*":
            return Cigar()

        cigarstring_length = len(cigarstring)
        if cigarstring_length == 0:
            raise CigarParsingException("Cigar string was empty")

        elements = []
        i = 0
        while i < cigarstring_length:
            if not cigarstring[i].isdigit():
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            length = int(cigarstring[i])
            i += 1
            while i < cigarstring_length and cigarstring[i].isdigit():
                length = (length * 10) + int(cigarstring[i])
                i += 1
            if i == cigarstring_length:
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            try:
                operator = CigarOp.from_character(cigarstring[i])
                elements.append(CigarElement(length, operator))
            except KeyError as ex:
                # cigar operator was not valid
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            except IndexError as ex:
                # missing cigar operator (i == len(cigarstring))
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            i += 1
        return Cigar(tuple(elements))

    def __str__(self) -> str:
        if self.elements:
            return "".join([str(e) for e in self.elements])
        else:
            return "*"

    def reversed(self) -> "Cigar":
        """Returns a copy of the Cigar with the elements in reverse order."""
        return Cigar(tuple(reversed(self.elements)))

    def length_on_query(self) -> int:
        """Returns the length of the alignment on the query sequence."""
        return sum([elem.length_on_query for elem in self.elements])

    def length_on_target(self) -> int:
        """Returns the length of the alignment on the target sequence."""
        return sum([elem.length_on_target for elem in self.elements])

    def query_alignment_offsets(self) -> Tuple[int, int]:
        """
        Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

        The resulting range will contain the range of positions in the SEQ string for
        the bases that are aligned.
        If counting from the end of the query is desired, use
        `cigar.reversed().query_alignment_offsets()`

        Returns:
            A tuple (start, stop) containing the start and stop positions
                of the aligned part of the query. These offsets are 0-based and open-ended, with
                respect to the beginning of the query.

        Raises:
            ValueError: If according to the cigar, there are no aligned query bases.
        """
        start_offset: int = 0
        end_offset: int = 0
        element: CigarElement
        alignment_began = False
        for element in self.elements:
            if element.operator.is_clipping and not alignment_began:
                # We are in the clipping operators preceding the alignment
                # Note: hardclips have length-on-query=0
                start_offset += element.length_on_query
                end_offset += element.length_on_query
            elif not element.operator.is_clipping:
                # We are within the alignment
                alignment_began = True
                end_offset += element.length_on_query
            else:
                # We have exited the alignment and are in the clipping operators after the alignment
                break

        if start_offset == end_offset:
            raise ValueError(f"Cigar {self} has no aligned bases")
        return start_offset, end_offset
Functions
from_cigarstring classmethod
from_cigarstring(cigarstring: str) -> Cigar

Constructs a Cigar from a string returned by pysam.

If "*" is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_cigarstring(cls, cigarstring: str) -> "Cigar":
    """Constructs a Cigar from a string returned by pysam.

    If "*" is given, returns an empty Cigar.
    """
    if cigarstring == "*":
        return Cigar()

    cigarstring_length = len(cigarstring)
    if cigarstring_length == 0:
        raise CigarParsingException("Cigar string was empty")

    elements = []
    i = 0
    while i < cigarstring_length:
        if not cigarstring[i].isdigit():
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        length = int(cigarstring[i])
        i += 1
        while i < cigarstring_length and cigarstring[i].isdigit():
            length = (length * 10) + int(cigarstring[i])
            i += 1
        if i == cigarstring_length:
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        try:
            operator = CigarOp.from_character(cigarstring[i])
            elements.append(CigarElement(length, operator))
        except KeyError as ex:
            # cigar operator was not valid
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        except IndexError as ex:
            # missing cigar operator (i == len(cigarstring))
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        i += 1
    return Cigar(tuple(elements))
from_cigartuples classmethod
from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) -> Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp() for more information on the various operators. If None is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_cigartuples(cls, cigartuples: Optional[List[Tuple[int, int]]]) -> "Cigar":
    """Returns a Cigar from a list of tuples returned by pysam.

    Each tuple denotes the operation and length.  See
    [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
    various operators.  If None is given, returns an empty Cigar.
    """
    if cigartuples is None or cigartuples == []:
        return Cigar()
    try:
        elements = []
        for code, length in cigartuples:
            operator = CigarOp.from_code(code)
            elements.append(CigarElement(length, operator))
        return Cigar(tuple(elements))
    except Exception as ex:
        raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex
length_on_query
length_on_query() -> int

Returns the length of the alignment on the query sequence.

Source code in fgpyo/sam/__init__.py
def length_on_query(self) -> int:
    """Returns the length of the alignment on the query sequence."""
    return sum([elem.length_on_query for elem in self.elements])
length_on_target
length_on_target() -> int

Returns the length of the alignment on the target sequence.

Source code in fgpyo/sam/__init__.py
def length_on_target(self) -> int:
    """Returns the length of the alignment on the target sequence."""
    return sum([elem.length_on_target for elem in self.elements])
query_alignment_offsets
query_alignment_offsets() -> Tuple[int, int]

Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

The resulting range will contain the range of positions in the SEQ string for the bases that are aligned. If counting from the end of the query is desired, use cigar.reversed().query_alignment_offsets()

Returns:

Type Description
Tuple[int, int]

A tuple (start, stop) containing the start and stop positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query.

Raises:

Type Description
ValueError

If according to the cigar, there are no aligned query bases.

Source code in fgpyo/sam/__init__.py
def query_alignment_offsets(self) -> Tuple[int, int]:
    """
    Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

    The resulting range will contain the range of positions in the SEQ string for
    the bases that are aligned.
    If counting from the end of the query is desired, use
    `cigar.reversed().query_alignment_offsets()`

    Returns:
        A tuple (start, stop) containing the start and stop positions
            of the aligned part of the query. These offsets are 0-based and open-ended, with
            respect to the beginning of the query.

    Raises:
        ValueError: If according to the cigar, there are no aligned query bases.
    """
    start_offset: int = 0
    end_offset: int = 0
    element: CigarElement
    alignment_began = False
    for element in self.elements:
        if element.operator.is_clipping and not alignment_began:
            # We are in the clipping operators preceding the alignment
            # Note: hardclips have length-on-query=0
            start_offset += element.length_on_query
            end_offset += element.length_on_query
        elif not element.operator.is_clipping:
            # We are within the alignment
            alignment_began = True
            end_offset += element.length_on_query
        else:
            # We have exited the alignment and are in the clipping operators after the alignment
            break

    if start_offset == end_offset:
        raise ValueError(f"Cigar {self} has no aligned bases")
    return start_offset, end_offset
reversed
reversed() -> Cigar

Returns a copy of the Cigar with the elements in reverse order.

Source code in fgpyo/sam/__init__.py
def reversed(self) -> "Cigar":
    """Returns a copy of the Cigar with the elements in reverse order."""
    return Cigar(tuple(reversed(self.elements)))
CigarElement

Represents an element in a Cigar

Attributes:

Name Type Description
- length (int

the length of the element

- operator (CigarOp

the operator of the element

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, slots=True, auto_attribs=True)
class CigarElement:
    """Represents an element in a Cigar

    Attributes:
        - length (int): the length of the element
        - operator (CigarOp): the operator of the element
    """

    length: int
    operator: CigarOp

    def __attrs_post_init__(self) -> None:
        """Validates the length attribute is greater than zero."""
        if self.length <= 0:
            raise ValueError(f"Cigar element must have a length > 0, found {self.length}")

    @property
    def length_on_query(self) -> int:
        """Returns the length of the element on the query sequence."""
        return self.length if self.operator.consumes_query else 0

    @property
    def length_on_target(self) -> int:
        """Returns the length of the element on the target (often reference) sequence."""
        return self.length if self.operator.consumes_reference else 0

    def __str__(self) -> str:
        return f"{self.length}{self.operator.character}"
Attributes
length_on_query property
length_on_query: int

Returns the length of the element on the query sequence.

length_on_target property
length_on_target: int

Returns the length of the element on the target (often reference) sequence.

Functions
__attrs_post_init__
__attrs_post_init__() -> None

Validates the length attribute is greater than zero.

Source code in fgpyo/sam/__init__.py
def __attrs_post_init__(self) -> None:
    """Validates the length attribute is greater than zero."""
    if self.length <= 0:
        raise ValueError(f"Cigar element must have a length > 0, found {self.length}")
CigarOp

Bases: Enum

Enumeration of operators that can appear in a Cigar string.

Attributes:

Name Type Description
code int

The ~pysam cigar operator code.

character int

The single character cigar operator.

consumes_query bool

True if this operator consumes query bases, False otherwise.

consumes_target bool

True if this operator consumes target bases, False otherwise.

Source code in fgpyo/sam/__init__.py
@enum.unique
class CigarOp(enum.Enum):
    """Enumeration of operators that can appear in a Cigar string.

    Attributes:
        code (int): The `~pysam` cigar operator code.
        character (int): The single character cigar operator.
        consumes_query (bool): True if this operator consumes query bases, False otherwise.
        consumes_target (bool): True if this operator consumes target bases, False otherwise.
    """

    M = (0, "M", True, True)  #: Match or Mismatch the reference
    I = (1, "I", True, False)  #: Insertion versus the reference  # noqa: E741
    D = (2, "D", False, True)  #: Deletion versus the reference
    N = (3, "N", False, True)  #: Skipped region from the reference
    S = (4, "S", True, False)  #: Soft clip
    H = (5, "H", False, False)  #: Hard clip
    P = (6, "P", False, False)  #: Padding
    EQ = (7, "=", True, True)  #: Matches the reference
    X = (8, "X", True, True)  #: Mismatches the reference

    def __init__(
        self, code: int, character: str, consumes_query: bool, consumes_reference: bool
    ) -> None:
        self.code = code
        self.character = character
        self.consumes_query = consumes_query
        self.consumes_reference = consumes_reference

    @staticmethod
    def from_character(character: str) -> "CigarOp":
        """Returns the operator from the single character."""
        if CigarOp.EQ.character == character:
            return CigarOp.EQ
        else:
            return CigarOp[character]

    @staticmethod
    def from_code(code: int) -> "CigarOp":
        """Returns the operator from the given operator code.

        Note: this is mainly used to get the operator from :py:mod:`~pysam`.
        """
        return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]

    @property
    def is_indel(self) -> bool:
        """Returns true if the operator is an indel, false otherwise."""
        return self == CigarOp.I or self == CigarOp.D

    @property
    def is_clipping(self) -> bool:
        """Returns true if the operator is a soft/hard clip, false otherwise."""
        return self == CigarOp.S or self == CigarOp.H
Attributes
is_clipping property
is_clipping: bool

Returns true if the operator is a soft/hard clip, false otherwise.

is_indel property
is_indel: bool

Returns true if the operator is an indel, false otherwise.

Functions
from_character staticmethod
from_character(character: str) -> CigarOp

Returns the operator from the single character.

Source code in fgpyo/sam/__init__.py
@staticmethod
def from_character(character: str) -> "CigarOp":
    """Returns the operator from the single character."""
    if CigarOp.EQ.character == character:
        return CigarOp.EQ
    else:
        return CigarOp[character]
from_code staticmethod
from_code(code: int) -> CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from :py:mod:~pysam.

Source code in fgpyo/sam/__init__.py
@staticmethod
def from_code(code: int) -> "CigarOp":
    """Returns the operator from the given operator code.

    Note: this is mainly used to get the operator from :py:mod:`~pysam`.
    """
    return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]
CigarParsingException

Bases: Exception

The exception raised specific to parsing a cigar.

Source code in fgpyo/sam/__init__.py
class CigarParsingException(Exception):
    """The exception raised specific to parsing a cigar."""

    pass
PairOrientation

Bases: Enum

Enumerations of read pair orientations.

Source code in fgpyo/sam/__init__.py
@enum.unique
class PairOrientation(enum.Enum):
    """Enumerations of read pair orientations."""

    FR = "FR"
    """A pair orientation for forward-reverse reads ("innie")."""

    RF = "RF"
    """A pair orientation for reverse-forward reads ("outie")."""

    TANDEM = "TANDEM"
    """A pair orientation for tandem (forward-forward or reverse-reverse) reads."""

    @classmethod
    def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
        cls, rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None
    ) -> Optional["PairOrientation"]:
        """Returns the pair orientation if both reads are mapped to the same reference sequence.

        Args:
            rec1: The first record in the pair.
            rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

        See:
            [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
        """

        if rec2 is None:
            rec2_is_unmapped = rec1.mate_is_unmapped
            rec2_reference_id = rec1.next_reference_id
        else:
            rec2_is_unmapped = rec2.is_unmapped
            rec2_reference_id = rec2.reference_id

        if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
            return None

        if rec2 is None:
            rec2_is_forward = rec1.mate_is_forward
            rec2_reference_start = rec1.next_reference_start
        else:
            rec2_is_forward = rec2.is_forward
            rec2_reference_start = rec2.reference_start

        if rec1.is_forward is rec2_is_forward:
            return PairOrientation.TANDEM
        if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
            return PairOrientation.RF

        if rec2 is None:
            if not rec1.has_tag("MC"):
                raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
            rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
            rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
        else:
            rec2_reference_end = rec2.reference_end

        if rec1.reference_start < rec2_reference_end:
            return PairOrientation.FR
        else:
            return PairOrientation.RF
Attributes
FR class-attribute instance-attribute
FR = 'FR'

A pair orientation for forward-reverse reads ("innie").

RF class-attribute instance-attribute
RF = 'RF'

A pair orientation for reverse-forward reads ("outie").

TANDEM class-attribute instance-attribute
TANDEM = 'TANDEM'

A pair orientation for tandem (forward-forward or reverse-reverse) reads.

Functions
from_recs classmethod
from_recs(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> Optional[PairOrientation]

Returns the pair orientation if both reads are mapped to the same reference sequence.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
See

htsjdk.samtools.SamPairUtil.getPairOrientation()

Source code in fgpyo/sam/__init__.py
@classmethod
def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
    cls, rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None
) -> Optional["PairOrientation"]:
    """Returns the pair orientation if both reads are mapped to the same reference sequence.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

    See:
        [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
    """

    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return None

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    if rec1.is_forward is rec2_is_forward:
        return PairOrientation.TANDEM
    if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
        return PairOrientation.RF

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        rec2_reference_end = rec2.reference_end

    if rec1.reference_start < rec2_reference_end:
        return PairOrientation.FR
    else:
        return PairOrientation.RF
ReadEditInfo

Counts various stats about how a read compares to a reference sequence.

Attributes:

Name Type Description
matches int

the number of bases in the read that match the reference

mismatches int

the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.

insertions int

the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.

inserted_bases int

the total number of bases contained within insertions in the read

deletions int

the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.

deleted_bases int

the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).

nm int

the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class ReadEditInfo:
    """
    Counts various stats about how a read compares to a reference sequence.

    Attributes:
        matches: the number of bases in the read that match the reference
        mismatches: the number of mismatches between the read sequence and the reference sequence
            as dictated by the alignment.  Like as defined for the SAM NM tag computation, any base
            except A/C/G/T in the read is considered a mismatch.
        insertions: the number of insertions in the read vs. the reference.  I.e. the number of I
            operators in the CIGAR string.
        inserted_bases: the total number of bases contained within insertions in the read
        deletions: the number of deletions in the read vs. the reference.  I.e. the number of D
            operators in the CIGAT string.
        deleted_bases: the total number of that are deleted within the alignment (i.e. bases in
            the reference but not in the read).
        nm: the computed value of the SAM NM tag, calculated as mismatches + inserted_bases +
            deleted_bases
    """

    matches: int
    mismatches: int
    insertions: int
    inserted_bases: int
    deletions: int
    deleted_bases: int
    nm: int
SamFileType

Bases: Enum

Enumeration of valid SAM/BAM/CRAM file types.

Attributes:

Name Type Description
mode str

The additional mode character to add when opening this file type.

ext str

The standard file extension for this file type.

Source code in fgpyo/sam/__init__.py
@enum.unique
class SamFileType(enum.Enum):
    """Enumeration of valid SAM/BAM/CRAM file types.

    Attributes:
        mode (str): The additional mode character to add when opening this file type.
        ext (str): The standard file extension for this file type.
    """

    def __init__(self, mode: str, ext: str) -> None:
        self.mode = mode
        self.extension = ext

    SAM = ("", ".sam")
    BAM = ("b", ".bam")
    CRAM = ("c", ".cram")

    @property
    def indexable(self) -> bool:
        """True if the file type can be indexed, false otherwise."""
        return self is SamFileType.BAM or self is SamFileType.CRAM

    @classmethod
    def from_path(cls, path: Union[Path, str]) -> "SamFileType":
        """Infers the file type based on the file extension.

        Args:
            path: the path to the SAM/BAM/CRAM to read or write.
        """
        ext = Path(path).suffix
        try:
            return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
        except StopIteration as ex:
            raise ValueError(f"Could not infer file type from {path}") from ex
Attributes
indexable property
indexable: bool

True if the file type can be indexed, false otherwise.

Functions
from_path classmethod
from_path(path: Union[Path, str]) -> SamFileType

Infers the file type based on the file extension.

Parameters:

Name Type Description Default
path Union[Path, str]

the path to the SAM/BAM/CRAM to read or write.

required
Source code in fgpyo/sam/__init__.py
@classmethod
def from_path(cls, path: Union[Path, str]) -> "SamFileType":
    """Infers the file type based on the file extension.

    Args:
        path: the path to the SAM/BAM/CRAM to read or write.
    """
    ext = Path(path).suffix
    try:
        return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
    except StopIteration as ex:
        raise ValueError(f"Could not infer file type from {path}") from ex
SamOrder

Bases: Enum

Enumerations of possible sort orders for a SAM file.

Source code in fgpyo/sam/__init__.py
class SamOrder(enum.Enum):
    """
    Enumerations of possible sort orders for a SAM file.
    """

    Unsorted = "unsorted"  #: the SAM / BAM / CRAM is unsorted
    Coordinate = "coordinate"  #: coordinate sorted
    QueryName = "queryname"  #: queryname sorted
    Unknown = "unknown"  # Unknown SAM / BAM / CRAM sort order
SupplementaryAlignment

Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

Attributes:

Name Type Description
reference_name str

the name of the reference (i.e. contig, chromosome) aligned to

start int

the 0-based start position of the alignment

is_forward bool

true if the alignment is in the forward strand, false otherwise

cigar Cigar

the cigar for the alignment

mapq int

the mapping quality

nm int

the number of edits

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class SupplementaryAlignment:
    """Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

    Attributes:
        reference_name: the name of the reference (i.e. contig, chromosome) aligned to
        start: the 0-based start position of the alignment
        is_forward: true if the alignment is in the forward strand, false otherwise
        cigar: the cigar for the alignment
        mapq: the mapping quality
        nm: the number of edits
    """

    reference_name: str
    start: int
    is_forward: bool
    cigar: Cigar
    mapq: int
    nm: int

    def __str__(self) -> str:
        return ",".join(
            str(item)
            for item in (
                self.reference_name,
                self.start + 1,
                "+" if self.is_forward else "-",
                self.cigar,
                self.mapq,
                self.nm,
            )
        )

    @property
    def end(self) -> int:
        """The 0-based exclusive end position of the alignment."""
        return self.start + self.cigar.length_on_target()

    @staticmethod
    def parse(string: str) -> "SupplementaryAlignment":
        """Returns a supplementary alignment parsed from the given string.  The various fields
        should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`)
        """
        fields = string.split(",")
        return SupplementaryAlignment(
            reference_name=fields[0],
            start=int(fields[1]) - 1,
            is_forward=fields[2] == "+",
            cigar=Cigar.from_cigarstring(fields[3]),
            mapq=int(fields[4]),
            nm=int(fields[5]),
        )

    @staticmethod
    def parse_sa_tag(tag: str) -> List["SupplementaryAlignment"]:
        """Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty
        or contains just a single semi-colon then an empty list will be returned.  Otherwise
        a list containing a SupplementaryAlignment per ;-separated value in the tag will
        be returned.
        """
        return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]

    @classmethod
    def from_read(cls, read: pysam.AlignedSegment) -> List["SupplementaryAlignment"]:
        """
        Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

        Args:
            read: An alignment. The presence of the "SA" tag is not required.

        Returns:
            A list of all SupplementaryAlignments present in the SA tag.
            If the SA tag is not present, or it is empty, an empty list will be returned.
        """
        if read.has_tag("SA"):
            sa_tag: str = cast(str, read.get_tag("SA"))
            return cls.parse_sa_tag(sa_tag)
        else:
            return []
Attributes
end property
end: int

The 0-based exclusive end position of the alignment.

Functions
from_read classmethod
from_read(read: AlignedSegment) -> List[SupplementaryAlignment]

Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

Parameters:

Name Type Description Default
read AlignedSegment

An alignment. The presence of the "SA" tag is not required.

required

Returns:

Type Description
List[SupplementaryAlignment]

A list of all SupplementaryAlignments present in the SA tag.

List[SupplementaryAlignment]

If the SA tag is not present, or it is empty, an empty list will be returned.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_read(cls, read: pysam.AlignedSegment) -> List["SupplementaryAlignment"]:
    """
    Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

    Args:
        read: An alignment. The presence of the "SA" tag is not required.

    Returns:
        A list of all SupplementaryAlignments present in the SA tag.
        If the SA tag is not present, or it is empty, an empty list will be returned.
    """
    if read.has_tag("SA"):
        sa_tag: str = cast(str, read.get_tag("SA"))
        return cls.parse_sa_tag(sa_tag)
    else:
        return []
parse staticmethod
parse(string: str) -> SupplementaryAlignment

Returns a supplementary alignment parsed from the given string. The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4)

Source code in fgpyo/sam/__init__.py
@staticmethod
def parse(string: str) -> "SupplementaryAlignment":
    """Returns a supplementary alignment parsed from the given string.  The various fields
    should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`)
    """
    fields = string.split(",")
    return SupplementaryAlignment(
        reference_name=fields[0],
        start=int(fields[1]) - 1,
        is_forward=fields[2] == "+",
        cigar=Cigar.from_cigarstring(fields[3]),
        mapq=int(fields[4]),
        nm=int(fields[5]),
    )
parse_sa_tag staticmethod
parse_sa_tag(tag: str) -> List[SupplementaryAlignment]

Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.

Source code in fgpyo/sam/__init__.py
@staticmethod
def parse_sa_tag(tag: str) -> List["SupplementaryAlignment"]:
    """Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty
    or contains just a single semi-colon then an empty list will be returned.  Otherwise
    a list containing a SupplementaryAlignment per ;-separated value in the tag will
    be returned.
    """
    return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]
Template

A container for alignment records corresponding to a single sequenced template or insert.

It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.

In the special cases there are alignments records that are both secondary and supplementary then they will be stored upon the r1_supplementals and r2_supplementals fields only.

Attributes:

Name Type Description
name str

the name of the template/query

r1 Optional[AlignedSegment]

Primary non-supplementary alignment for read 1, or None if there is none

r2 Optional[AlignedSegment]

Primary non-supplementary alignment for read 2, or None if there is none

r1_supplementals List[AlignedSegment]

Supplementary alignments for read 1

r2_supplementals List[AlignedSegment]

Supplementary alignments for read 2

r1_secondaries List[AlignedSegment]

Secondary (non-primary, non-supplementary) alignments for read 1

r2_secondaries List[AlignedSegment]

Secondary (non-primary, non-supplementary) alignments for read 2

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class Template:
    """A container for alignment records corresponding to a single sequenced template
    or insert.

    It is strongly preferred that new Template instances be created with `Template.build()`
    which will ensure that reads are stored in the correct Template property, and run basic
    validations of the Template by default.  If constructing Template instances by construction
    users are encouraged to use the validate method post-construction.

    In the special cases there are alignments records that are _*both secondary and supplementary*_
    then they will be stored upon the `r1_supplementals` and `r2_supplementals` fields only.

    Attributes:
        name: the name of the template/query
        r1: Primary non-supplementary alignment for read 1, or None if there is none
        r2: Primary non-supplementary alignment for read 2, or None if there is none
        r1_supplementals: Supplementary alignments for read 1
        r2_supplementals: Supplementary alignments for read 2
        r1_secondaries: Secondary (non-primary, non-supplementary) alignments for read 1
        r2_secondaries: Secondary (non-primary, non-supplementary) alignments for read 2
    """

    name: str
    r1: Optional[AlignedSegment]
    r2: Optional[AlignedSegment]
    r1_supplementals: List[AlignedSegment]
    r2_supplementals: List[AlignedSegment]
    r1_secondaries: List[AlignedSegment]
    r2_secondaries: List[AlignedSegment]

    @staticmethod
    def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
        """Returns an iterator over templates. Assumes the input iterable is queryname grouped,
        and gathers consecutive runs of records sharing a common query name into templates."""
        return TemplateIterator(alns)

    @staticmethod
    def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
        """Build a template from a set of records all with the same queryname."""
        name = None
        r1 = None
        r2 = None
        r1_supplementals: List[AlignedSegment] = []
        r2_supplementals: List[AlignedSegment] = []
        r1_secondaries: List[AlignedSegment] = []
        r2_secondaries: List[AlignedSegment] = []

        for rec in recs:
            if name is None:
                name = rec.query_name

            is_r1 = not rec.is_paired or rec.is_read1

            if not rec.is_supplementary and not rec.is_secondary:
                if is_r1:
                    assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                    r1 = rec
                else:
                    assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                    r2 = rec
            elif rec.is_supplementary:
                if is_r1:
                    r1_supplementals.append(rec)
                else:
                    r2_supplementals.append(rec)
            elif rec.is_secondary:
                if is_r1:
                    r1_secondaries.append(rec)
                else:
                    r2_secondaries.append(rec)

        assert name is not None, "Cannot construct a template from zero records."

        template = Template(
            name=name,
            r1=r1,
            r2=r2,
            r1_supplementals=r1_supplementals,
            r2_supplementals=r2_supplementals,
            r1_secondaries=r1_secondaries,
            r2_secondaries=r2_secondaries,
        )

        if validate:
            template.validate()

        return template

    def validate(self) -> None:
        """Performs sanity checks that all the records in the Template are as expected."""
        for rec in self.all_recs():
            assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

        if self.r1 is not None:
            assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
            assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
            assert not self.r1.is_secondary, "R1 primary flagged as secondary"

        if self.r2 is not None:
            assert self.r2.is_read2, "R2 not flagged as R2"
            assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
            assert not self.r2.is_secondary, "R2 primary flagged as secondary"

        for rec in self.r1_secondaries:
            assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
            assert rec.is_secondary, "R1 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

        for rec in self.r1_supplementals:
            assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
            assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

        for rec in self.r2_secondaries:
            assert rec.is_read2, "R2 secondary not flagged as R2"
            assert rec.is_secondary, "R2 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

        for rec in self.r2_supplementals:
            assert rec.is_read2, "R2 supp. not flagged as R2"
            assert rec.is_supplementary, "R2 supp. not flagged as supplementary"

    def primary_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the primary records for the template."""
        return (r for r in (self.r1, self.r2) if r is not None)

    def all_r1s(self) -> Iterator[AlignedSegment]:
        """Yields all R1 alignments of this template including secondary and supplementary."""
        r1_primary = [] if self.r1 is None else [self.r1]
        return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)

    def all_r2s(self) -> Iterator[AlignedSegment]:
        """Yields all R2 alignments of this template including secondary and supplementary."""
        r2_primary = [] if self.r2 is None else [self.r2]
        return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)

    def all_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the records for the template."""
        for rec in self.primary_recs():
            yield rec

        for recs in (
            self.r1_supplementals,
            self.r1_secondaries,
            self.r2_supplementals,
            self.r2_secondaries,
        ):
            for rec in recs:
                yield rec

    def set_mate_info(
        self,
        is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
        isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
    ) -> Self:
        """Reset all mate information on every alignment in the template.

        Args:
            is_proper_pair: A function that takes two alignments and determines proper pair status.
            isize: A function that takes the two alignments and calculates their isize.
        """
        if self.r1 is not None and self.r2 is not None:
            set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
        if self.r1 is not None:
            for rec in self.r2_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
            for rec in self.r2_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
        if self.r2 is not None:
            for rec in self.r1_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
            for rec in self.r1_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
        return self

    def write_to(
        self,
        writer: SamFile,
        primary_only: bool = False,
    ) -> None:
        """Write the records associated with the template to file.

        Args:
            writer: An open, writable AlignmentFile.
            primary_only: If True, only write primary alignments.
        """

        if primary_only:
            rec_iter = self.primary_recs()
        else:
            rec_iter = self.all_recs()

        for rec in rec_iter:
            writer.write(rec)

    def set_tag(
        self,
        tag: str,
        value: Union[str, int, float, None],
    ) -> None:
        """Add a tag to all records associated with the template.

        Setting a tag to `None` will remove the tag.

        Args:
            tag: The name of the tag.
            value: The value of the tag.
        """

        assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

        for rec in self.all_recs():
            rec.set_tag(tag, value)
Functions
all_r1s
all_r1s() -> Iterator[AlignedSegment]

Yields all R1 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py
def all_r1s(self) -> Iterator[AlignedSegment]:
    """Yields all R1 alignments of this template including secondary and supplementary."""
    r1_primary = [] if self.r1 is None else [self.r1]
    return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)
all_r2s
all_r2s() -> Iterator[AlignedSegment]

Yields all R2 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py
def all_r2s(self) -> Iterator[AlignedSegment]:
    """Yields all R2 alignments of this template including secondary and supplementary."""
    r2_primary = [] if self.r2 is None else [self.r2]
    return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)
all_recs
all_recs() -> Iterator[AlignedSegment]

Returns a list with all the records for the template.

Source code in fgpyo/sam/__init__.py
def all_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the records for the template."""
    for rec in self.primary_recs():
        yield rec

    for recs in (
        self.r1_supplementals,
        self.r1_secondaries,
        self.r2_supplementals,
        self.r2_secondaries,
    ):
        for rec in recs:
            yield rec
build staticmethod
build(recs: Iterable[AlignedSegment], validate: bool = True) -> Template

Build a template from a set of records all with the same queryname.

Source code in fgpyo/sam/__init__.py
@staticmethod
def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
    """Build a template from a set of records all with the same queryname."""
    name = None
    r1 = None
    r2 = None
    r1_supplementals: List[AlignedSegment] = []
    r2_supplementals: List[AlignedSegment] = []
    r1_secondaries: List[AlignedSegment] = []
    r2_secondaries: List[AlignedSegment] = []

    for rec in recs:
        if name is None:
            name = rec.query_name

        is_r1 = not rec.is_paired or rec.is_read1

        if not rec.is_supplementary and not rec.is_secondary:
            if is_r1:
                assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                r1 = rec
            else:
                assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                r2 = rec
        elif rec.is_supplementary:
            if is_r1:
                r1_supplementals.append(rec)
            else:
                r2_supplementals.append(rec)
        elif rec.is_secondary:
            if is_r1:
                r1_secondaries.append(rec)
            else:
                r2_secondaries.append(rec)

    assert name is not None, "Cannot construct a template from zero records."

    template = Template(
        name=name,
        r1=r1,
        r2=r2,
        r1_supplementals=r1_supplementals,
        r2_supplementals=r2_supplementals,
        r1_secondaries=r1_secondaries,
        r2_secondaries=r2_secondaries,
    )

    if validate:
        template.validate()

    return template
iterator staticmethod
iterator(alns: Iterator[AlignedSegment]) -> Iterator[Template]

Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.

Source code in fgpyo/sam/__init__.py
@staticmethod
def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
    """Returns an iterator over templates. Assumes the input iterable is queryname grouped,
    and gathers consecutive runs of records sharing a common query name into templates."""
    return TemplateIterator(alns)
primary_recs
primary_recs() -> Iterator[AlignedSegment]

Returns a list with all the primary records for the template.

Source code in fgpyo/sam/__init__.py
def primary_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the primary records for the template."""
    return (r for r in (self.r1, self.r2) if r is not None)
set_mate_info
set_mate_info(is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> Self

Reset all mate information on every alignment in the template.

Parameters:

Name Type Description Default
is_proper_pair Callable[[AlignedSegment, AlignedSegment], bool]

A function that takes two alignments and determines proper pair status.

is_proper_pair
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize
Source code in fgpyo/sam/__init__.py
def set_mate_info(
    self,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> Self:
    """Reset all mate information on every alignment in the template.

    Args:
        is_proper_pair: A function that takes two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.
    """
    if self.r1 is not None and self.r2 is not None:
        set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
    if self.r1 is not None:
        for rec in self.r2_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
        for rec in self.r2_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
    if self.r2 is not None:
        for rec in self.r1_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
        for rec in self.r1_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
    return self
set_tag
set_tag(tag: str, value: Union[str, int, float, None]) -> None

Add a tag to all records associated with the template.

Setting a tag to None will remove the tag.

Parameters:

Name Type Description Default
tag str

The name of the tag.

required
value Union[str, int, float, None]

The value of the tag.

required
Source code in fgpyo/sam/__init__.py
def set_tag(
    self,
    tag: str,
    value: Union[str, int, float, None],
) -> None:
    """Add a tag to all records associated with the template.

    Setting a tag to `None` will remove the tag.

    Args:
        tag: The name of the tag.
        value: The value of the tag.
    """

    assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

    for rec in self.all_recs():
        rec.set_tag(tag, value)
validate
validate() -> None

Performs sanity checks that all the records in the Template are as expected.

Source code in fgpyo/sam/__init__.py
def validate(self) -> None:
    """Performs sanity checks that all the records in the Template are as expected."""
    for rec in self.all_recs():
        assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

    if self.r1 is not None:
        assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
        assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
        assert not self.r1.is_secondary, "R1 primary flagged as secondary"

    if self.r2 is not None:
        assert self.r2.is_read2, "R2 not flagged as R2"
        assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
        assert not self.r2.is_secondary, "R2 primary flagged as secondary"

    for rec in self.r1_secondaries:
        assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
        assert rec.is_secondary, "R1 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

    for rec in self.r1_supplementals:
        assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
        assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

    for rec in self.r2_secondaries:
        assert rec.is_read2, "R2 secondary not flagged as R2"
        assert rec.is_secondary, "R2 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

    for rec in self.r2_supplementals:
        assert rec.is_read2, "R2 supp. not flagged as R2"
        assert rec.is_supplementary, "R2 supp. not flagged as supplementary"
write_to
write_to(writer: AlignmentFile, primary_only: bool = False) -> None

Write the records associated with the template to file.

Parameters:

Name Type Description Default
writer AlignmentFile

An open, writable AlignmentFile.

required
primary_only bool

If True, only write primary alignments.

False
Source code in fgpyo/sam/__init__.py
def write_to(
    self,
    writer: SamFile,
    primary_only: bool = False,
) -> None:
    """Write the records associated with the template to file.

    Args:
        writer: An open, writable AlignmentFile.
        primary_only: If True, only write primary alignments.
    """

    if primary_only:
        rec_iter = self.primary_recs()
    else:
        rec_iter = self.all_recs()

    for rec in rec_iter:
        writer.write(rec)
TemplateIterator

Bases: Iterator[Template]

An iterator that converts an iterator over query-grouped reads into an iterator over templates.

Source code in fgpyo/sam/__init__.py
class TemplateIterator(Iterator[Template]):
    """
    An iterator that converts an iterator over query-grouped reads into an iterator
    over templates.
    """

    def __init__(self, iterator: Iterator[AlignedSegment]) -> None:
        self._iter = PeekableIterator(iterator)

    def __iter__(self) -> Iterator[Template]:
        return self

    def __next__(self) -> Template:
        name = self._iter.peek().query_name
        recs = self._iter.takewhile(lambda r: r.query_name == name)
        return Template.build(recs, validate=False)

Functions

calculate_edit_info
calculate_edit_info(rec: AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) -> ReadEditInfo

Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag. The read must be aligned.

Parameters:

Name Type Description Default
rec AlignedSegment

the read/record for which to calculate values

required
reference_sequence str

the reference sequence (or fragment thereof) that the read is aligned to

required
reference_offset Optional[int]

if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start

None

Returns:

Type Description
ReadEditInfo

a ReadEditInfo with information about how the read differs from the reference

Source code in fgpyo/sam/__init__.py
def calculate_edit_info(
    rec: AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None
) -> ReadEditInfo:
    """
    Constructs a `ReadEditInfo` instance giving summary stats about how the read aligns to the
    reference.  Computes the number of mismatches, indels, indel bases and the SAM NM tag.
    The read must be aligned.

    Args:
        rec: the read/record for which to calculate values
        reference_sequence: the reference sequence (or fragment thereof) that the read is
            aligned to
        reference_offset: if provided, assume that reference_sequence[reference_offset] is the
            first base aligned to in reference_sequence, otherwise use r.reference_start

    Returns:
        a ReadEditInfo with information about how the read differs from the reference
    """
    assert not rec.is_unmapped, f"Cannot calculate edit info for unmapped read: {rec}"

    query_offset = 0
    target_offset = reference_offset if reference_offset is not None else rec.reference_start
    cigar = Cigar.from_cigartuples(rec.cigartuples)

    matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
    ok_bases = {"A", "C", "G", "T"}

    for elem in cigar.elements:
        op = elem.operator

        if op == CigarOp.I:
            insertions += 1
            ins_bases += elem.length
        elif op == CigarOp.D:
            deletions += 1
            del_bases += elem.length
        elif op == CigarOp.M or op == CigarOp.X or op == CigarOp.EQ:
            for i in range(0, elem.length):
                q = rec.query_sequence[query_offset + i].upper()
                t = reference_sequence[target_offset + i].upper()
                if q != t or q not in ok_bases:
                    mms += 1
                else:
                    matches += 1

        query_offset += elem.length_on_query
        target_offset += elem.length_on_target

    return ReadEditInfo(
        matches=matches,
        mismatches=mms,
        insertions=insertions,
        inserted_bases=ins_bases,
        deletions=deletions,
        deleted_bases=del_bases,
        nm=mms + ins_bases + del_bases,
    )
is_proper_pair
is_proper_pair(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None, max_insert_size: int = 1000, orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> bool

Determines if a pair of records are properly paired or not.

Criteria for records in a proper pair are
  • Both records are aligned
  • Both records are aligned to the same reference sequence
  • The pair orientation of the records is one of the valid pair orientations (default "FR")
  • The inferred insert size is not more than a maximum length (default 1000)

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
max_insert_size int

The maximum insert size to consider a pair "proper".

1000
orientations Collection[PairOrientation]

The valid set of orientations to consider a pair "proper".

DefaultProperlyPairedOrientations
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize
See

htsjdk.samtools.SamPairUtil.isProperPair()

Source code in fgpyo/sam/__init__.py
def is_proper_pair(
    rec1: AlignedSegment,
    rec2: Optional[AlignedSegment] = None,
    max_insert_size: int = 1000,
    orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> bool:
    """Determines if a pair of records are properly paired or not.

    Criteria for records in a proper pair are:
        - Both records are aligned
        - Both records are aligned to the same reference sequence
        - The pair orientation of the records is one of the valid pair orientations (default "FR")
        - The inferred insert size is not more than a maximum length (default 1000)

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
        max_insert_size: The maximum insert size to consider a pair "proper".
        orientations: The valid set of orientations to consider a pair "proper".
        isize: A function that takes the two alignments and calculates their isize.

    See:
        [`htsjdk.samtools.SamPairUtil.isProperPair()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L106-L125)
    """
    if rec2 is None:
        rec2_is_mapped = rec1.mate_is_mapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_mapped = rec2.is_mapped
        rec2_reference_id = rec2.reference_id

    return (
        rec1.is_mapped
        and rec2_is_mapped
        and rec1.reference_id == rec2_reference_id
        and PairOrientation.from_recs(rec1=rec1, rec2=rec2) in orientations
        and 0 < abs(isize(rec1, rec2)) <= max_insert_size
    )
isize
isize(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> int

Computes the insert size ("template length" or "TLEN") for a pair of records.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
Source code in fgpyo/sam/__init__.py
def isize(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> int:
    """Computes the insert size ("template length" or "TLEN") for a pair of records.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
    """
    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return 0

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    if rec1.is_forward and rec2_is_forward:
        return rec2_reference_start - rec1.reference_start
    if rec1.is_reverse and rec2_is_forward:
        return rec2_reference_start - rec1.reference_end

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine proper pair status without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        rec2_reference_end = rec2.reference_end

    if rec1.is_forward:
        return rec2_reference_end - rec1.reference_start
    else:
        return rec2_reference_end - rec1.reference_end
reader
reader(path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False) -> AlignmentFile

Opens a SAM/BAM/CRAM for reading.

To read from standard input, provide any of "-", "stdin", or "/dev/stdin" as the input path.

Parameters:

Name Type Description Default
path SamPath

a file handle or path to the SAM/BAM/CRAM to read or write.

required
file_type Optional[SamFileType]

the file type to assume when opening the file. If None, then the file type will be auto-detected.

None
unmapped bool

True if the file is unmapped and has no sequence dictionary, False otherwise.

False
Source code in fgpyo/sam/__init__.py
def reader(
    path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False
) -> SamFile:
    """Opens a SAM/BAM/CRAM for reading.

    To read from standard input, provide any of `"-"`, `"stdin"`, or `"/dev/stdin"` as the input
    `path`.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        file_type: the file type to assume when opening the file.  If None, then the file
            type will be auto-detected.
        unmapped: True if the file is unmapped and has no sequence dictionary, False otherwise.
    """
    return _pysam_open(path=path, open_for_reading=True, file_type=file_type, unmapped=unmapped)
set_mate_info
set_mate_info(rec1: AlignedSegment, rec2: AlignedSegment, is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> None

Resets mate pair information between two primary alignments that share a query name.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 AlignedSegment

The second record in the pair.

required
is_proper_pair Callable[[AlignedSegment, AlignedSegment], bool]

A function that takes the two alignments and determines proper pair status.

is_proper_pair
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize

Raises:

Type Description
ValueError

If rec1 and rec2 are of the same read ordinal.

ValueError

If either rec1 or rec2 is secondary or supplementary.

ValueError

If rec1 and rec2 do not share the same query name.

Source code in fgpyo/sam/__init__.py
def set_mate_info(
    rec1: AlignedSegment,
    rec2: AlignedSegment,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> None:
    """Resets mate pair information between two primary alignments that share a query name.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair.
        is_proper_pair: A function that takes the two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.

    Raises:
        ValueError: If rec1 and rec2 are of the same read ordinal.
        ValueError: If either rec1 or rec2 is secondary or supplementary.
        ValueError: If rec1 and rec2 do not share the same query name.
    """
    for dest, source in [(rec1, rec2), (rec2, rec1)]:
        _set_common_mate_fields(dest=dest, mate_primary=source)

    template_length = isize(rec1, rec2)
    rec1.template_length = template_length
    rec2.template_length = -template_length

    proper_pair = is_proper_pair(rec1, rec2)
    rec1.is_proper_pair = proper_pair
    rec2.is_proper_pair = proper_pair
set_mate_info_on_secondary
set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a secondary alignment from its mate's primary alignment.

Parameters:

Name Type Description Default
secondary AlignedSegment

The secondary alignment to set mate information upon.

required
mate_primary AlignedSegment

The primary alignment of the secondary's mate.

required

Raises:

Type Description
ValueError

If secondary and mate_primary are of the same read ordinal.

ValueError

If secondary and mate_primary do not share the same query name.

ValueError

If mate_primary is secondary or supplementary.

ValueError

If secondary is not marked as a secondary alignment.

Source code in fgpyo/sam/__init__.py
def set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """Set mate info on a secondary alignment from its mate's primary alignment.

    Args:
        secondary: The secondary alignment to set mate information upon.
        mate_primary: The primary alignment of the secondary's mate.

    Raises:
        ValueError: If secondary and mate_primary are of the same read ordinal.
        ValueError: If secondary and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If secondary is not marked as a secondary alignment.
    """
    if not secondary.is_secondary:
        raise ValueError("Cannot set mate info on an alignment not marked as secondary!")

    _set_common_mate_fields(dest=secondary, mate_primary=mate_primary)
set_mate_info_on_supplementary
set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a supplementary alignment from its mate's primary alignment.

Parameters:

Name Type Description Default
supp AlignedSegment

The supplementary alignment to set mate information upon.

required
mate_primary AlignedSegment

The primary alignment of the supplementary's mate.

required

Raises:

Type Description
ValueError

If supp and mate_primary are of the same read ordinal.

ValueError

If supp and mate_primary do not share the same query name.

ValueError

If mate_primary is secondary or supplementary.

ValueError

If supp is not marked as a supplementary alignment.

Source code in fgpyo/sam/__init__.py
def set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """Set mate info on a supplementary alignment from its mate's primary alignment.

    Args:
        supp: The supplementary alignment to set mate information upon.
        mate_primary: The primary alignment of the supplementary's mate.

    Raises:
        ValueError: If supp and mate_primary are of the same read ordinal.
        ValueError: If supp and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If supp is not marked as a supplementary alignment.
    """
    if not supp.is_supplementary:
        raise ValueError("Cannot set mate info on an alignment not marked as supplementary!")

    _set_common_mate_fields(dest=supp, mate_primary=mate_primary)

    # NB: for a non-secondary supplemental alignment, set the following to the same as the primary.
    if not supp.is_secondary:
        supp.is_proper_pair = mate_primary.is_proper_pair
        supp.template_length = -mate_primary.template_length
set_pair_info
set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None

Resets mate pair information between reads in a pair.

Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters:

Name Type Description Default
r1 AlignedSegment

Read 1 (first read in the template).

required
r2 AlignedSegment

Read 2 with the same query name as r1 (second read in the template).

required
proper_pair bool

whether the pair is proper or not.

True
Source code in fgpyo/sam/__init__.py
@deprecated("Use `set_mate_info()` instead. Deprecated after fgpyo 0.8.0.")
def set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None:
    """Resets mate pair information between reads in a pair.

    Can be handed reads that already have pairing flags setup or independent R1 and R2 records that
    are currently flagged as SE reads.

    Args:
        r1: Read 1 (first read in the template).
        r2: Read 2 with the same query name as r1 (second read in the template).
        proper_pair: whether the pair is proper or not.
    """
    if r1.query_name != r2.query_name:
        raise ValueError("Cannot set pair info on reads with different query names!")

    for r in [r1, r2]:
        r.is_paired = True

    r1.is_read1 = True
    r1.is_read2 = False
    r2.is_read2 = True
    r2.is_read1 = False

    set_mate_info(rec1=r1, rec2=r2, is_proper_pair=lambda a, b: proper_pair)
sum_of_base_qualities
sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int

Calculate the sum of base qualities score for an alignment record.

This function is useful for calculating the "mate score" as implemented in samtools fixmate. Consistently with samtools fixmate, this function returns 0 if the record has no base qualities.

Parameters:

Name Type Description Default
rec AlignedSegment

The alignment record to calculate the sum of base qualities from.

required
min_quality_score int

The minimum base quality score to use for summation.

15

Returns:

Type Description
int

The sum of base qualities on the input record. 0 if the record has no base qualities.

See

calc_sum_of_base_qualities() MD_MIN_QUALITY

Source code in fgpyo/sam/__init__.py
def sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int:
    """Calculate the sum of base qualities score for an alignment record.

    This function is useful for calculating the "mate score" as implemented in `samtools fixmate`.
    Consistently with `samtools fixmate`, this function returns 0 if the record has no base
    qualities.

    Args:
        rec: The alignment record to calculate the sum of base qualities from.
        min_quality_score: The minimum base quality score to use for summation.

    Returns:
        The sum of base qualities on the input record. 0 if the record has no base qualities.

    See:
        [`calc_sum_of_base_qualities()`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L227-L238)
        [`MD_MIN_QUALITY`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L42)
    """
    if rec.query_qualities is None or rec.query_qualities == NO_QUERY_QUALITIES:
        return 0

    score: int = sum(qual for qual in rec.query_qualities if qual >= min_quality_score)
    return score
writer
writer(path: SamPath, header: Union[str, Dict[str, Any], AlignmentHeader], file_type: Optional[SamFileType] = None) -> AlignmentFile

Opens a SAM/BAM/CRAM for writing.

To write to standard output, provide any of "-", "stdout", or "/dev/stdout" as the output path. Note: When writing to stdout, the file_type must be given.

Parameters:

Name Type Description Default
path SamPath

a file handle or path to the SAM/BAM/CRAM to read or write.

required
header Union[str, Dict[str, Any], AlignmentHeader]

Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.

required
file_type Optional[SamFileType]

the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object. This argument is required when writing to standard output.

None
Source code in fgpyo/sam/__init__.py
def writer(
    path: SamPath,
    header: Union[str, Dict[str, Any], SamHeader],
    file_type: Optional[SamFileType] = None,
) -> SamFile:
    """Opens a SAM/BAM/CRAM for writing.

    To write to standard output, provide any of `"-"`, `"stdout"`, or `"/dev/stdout"` as the output
    `path`. **Note**: When writing to `stdout`, the `file_type` _must_ be given.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        header: Either a string to use for the header or a multi-level dictionary.  The
            multi-level dictionary should be given as follows.  The first level are the four
            types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being
            a list of tag-value pairs. The header is constructed first from all the defined
            fields, followed by user tags in alphabetical order.
        file_type: the file type to assume when opening the file.  If `None`, then the
            filetype will be auto-detected and must be a path-like object. This argument is required
            when writing to standard output.
    """
    # Set the header for pysam's AlignmentFile
    key = "text" if isinstance(header, str) else "header"
    kwargs = {key: header}

    return _pysam_open(
        path=path, open_for_reading=False, file_type=file_type, unmapped=False, **kwargs
    )

Modules

builder
Classes for generating SAM and BAM files and records for testing

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

Classes
SamBuilder

Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file using to_path().

Source code in fgpyo/sam/builder.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
class SamBuilder:
    """Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

    Provides the ability to manufacture records from minimal arguments, while generating
    any remaining attributes to ensure a valid record.

    A builder is constructed with a handful of defaults including lengths for generated R1s
    and R2s, the default base quality score to use, a sequence dictionary and a single read group.

    Records are then added using the [`add_pair()`][fgpyo.sam.builder.SamBuilder.add_pair]
    method.  Once accumulated the records can be accessed in the order in which they were created
    through the [`to_unsorted_list()`][fgpyo.sam.builder.SamBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.sam.builder.SamBuilder.to_sorted_list].  The latter creates
    a temporary file to do the sorting and is somewhat slower as a result.  Lastly, the records can
    be written to a temporary file using
    [`to_path()`][fgpyo.sam.builder.SamBuilder.to_path].
    """

    # The default read one length
    DEFAULT_R1_LENGTH: int = 100

    # The default read two length
    DEFAULT_R2_LENGTH: int = 100

    @staticmethod
    def default_sd() -> List[Dict[str, Any]]:
        """Generates the sequence dictionary that is used by default by SamBuilder.

        Matches the names and lengths of the HG19 reference in use in production.

        Returns:
            A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
        """
        return [
            {"SN": "chr1", "LN": 249250621},
            {"SN": "chr2", "LN": 243199373},
            {"SN": "chr3", "LN": 198022430},
            {"SN": "chr4", "LN": 191154276},
            {"SN": "chr5", "LN": 180915260},
            {"SN": "chr6", "LN": 171115067},
            {"SN": "chr7", "LN": 159138663},
            {"SN": "chr8", "LN": 146364022},
            {"SN": "chr9", "LN": 141213431},
            {"SN": "chr10", "LN": 135534747},
            {"SN": "chr11", "LN": 135006516},
            {"SN": "chr12", "LN": 133851895},
            {"SN": "chr13", "LN": 115169878},
            {"SN": "chr14", "LN": 107349540},
            {"SN": "chr15", "LN": 102531392},
            {"SN": "chr16", "LN": 90354753},
            {"SN": "chr17", "LN": 81195210},
            {"SN": "chr18", "LN": 78077248},
            {"SN": "chr19", "LN": 59128983},
            {"SN": "chr20", "LN": 63025520},
            {"SN": "chr21", "LN": 48129895},
            {"SN": "chr22", "LN": 51304566},
            {"SN": "chrX", "LN": 155270560},
            {"SN": "chrY", "LN": 59373566},
            {"SN": "chrM", "LN": 16571},
        ]

    @staticmethod
    def default_rg() -> Dict[str, str]:
        """Returns the default read group used by the SamBuilder, as a dictionary."""
        return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}

    def __init__(
        self,
        r1_len: Optional[int] = None,
        r2_len: Optional[int] = None,
        base_quality: int = 30,
        mapping_quality: int = 60,
        sd: Optional[List[Dict[str, Any]]] = None,
        rg: Optional[Dict[str, str]] = None,
        extra_header: Optional[Dict[str, Any]] = None,
        seed: int = 42,
        sort_order: SamOrder = SamOrder.Coordinate,
    ) -> None:
        """Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

        Args:
            r1_len: The length of R1s to create unless otherwise specified
            r2_len: The length of R2s to create unless otherwise specified
            base_quality: The base quality of bases to create unless otherwise specified
            sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
            rg: a single read group as a dict; defaults to calling default_sd() if None
            extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                          `pysam.AlignmentHeader` for more details.
            seed: a seed value for random number/string generation
            sort_order: Order to sort records when writing to file, or output of to_sorted_list()
        """

        self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
        self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
        self.base_quality: int = base_quality
        self.mapping_quality: int = mapping_quality

        if not isinstance(sort_order, SamOrder):
            raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
        self._sort_order = sort_order

        self._header: Dict[str, Any] = {
            "HD": {"VN": "1.5", "SO": sort_order.value},
            "SQ": (sd if sd is not None else SamBuilder.default_sd()),
            "RG": [(rg if rg is not None else SamBuilder.default_rg())],
        }
        if extra_header is not None:
            self._header = {**self._header, **extra_header}
        self._samheader = AlignmentHeader.from_dict(self._header)
        self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

        self._random: Random = Random(seed)
        self._records: List[AlignedSegment] = []
        self._counter: int = 0

    def _next_name(self) -> str:
        """Returns the next available query/template name."""
        n = self._counter
        self._counter += 1
        return f"q{n:>04}"

    def _bases(self, length: int) -> str:
        """Returns a random string of bases of the length requested."""
        return "".join(self._random.choices("ACGT", k=length))

    def _new_rec(
        self,
        name: str,
        chrom: str,
        start: int,
        mapq: Optional[int],
        attrs: Optional[Dict[str, Any]],
    ) -> AlignedSegment:
        """Generates a new AlignedSegment.  Sets the segment up with the correct
        header and adds the RG attribute if not contained in attrs.

        Args:
            name: the name of the read/template
            chrom: the chromosome to which the read is mapped
            start: the start position of the read on the chromosome
            mapq: an optional mapping quality; use self.mapping_quality if None
            attrs: an optional dictionary of SAM attributes with two-char keys

        Returns:
            AlignedSegment: an aligned segment with name, chrom, pos, attributes the
                read group, and the unmapped flag all set appropriately.
        """
        if chrom is not sam.NO_REF_NAME and chrom not in self._seq_lookup:
            raise ValueError(f"{chrom} is not a valid chromosome name in this builder.")

        rec = AlignedSegment(header=self._samheader)
        rec.query_name = name
        rec.reference_name = chrom
        rec.reference_start = start
        rec.mapping_quality = mapq if mapq is not None else self.mapping_quality

        if chrom == sam.NO_REF_NAME or start == sam.NO_REF_POS:
            rec.is_unmapped = True
            rec.mapping_quality = 0

        attrs = attrs if attrs else dict()
        if "RG" not in attrs:
            attrs["RG"] = self.rg_id()
        rec.set_tags(list(attrs.items()))
        return rec

    def _set_flags(
        self,
        rec: pysam.AlignedSegment,
        read_num: Optional[int],
        strand: str,
        secondary: bool = False,
        supplementary: bool = False,
    ) -> None:
        """Appropriately sets most flag fields on the given read.

        Args:
            rec: the read to set the flags on
            read_num: Either None for an unpaired read, or 1 or 2
            strand: Either "+" or "-" to indicate strand of the read
        """
        rec.is_paired = read_num is not None
        rec.is_read1 = read_num == 1
        rec.is_read2 = read_num == 2
        rec.is_qcfail = False
        rec.is_duplicate = False
        rec.is_secondary = secondary
        rec.is_supplementary = supplementary
        if not rec.is_unmapped:
            rec.is_reverse = strand != "+"

    def _set_length_dependent_fields(
        self,
        rec: pysam.AlignedSegment,
        length: int,
        bases: Optional[str] = None,
        quals: Optional[List[int]] = None,
        cigar: Optional[str] = None,
    ) -> None:
        """Fills in bases, quals and cigar on a record.

        If any of bases, quals or cigar are defined, they must all have the same length/query
        length.  If none are defined then the length parameter is used.  Undefined values are
        synthesize at the inferred length.

        Args:
            rec: a SAM record
            length: the length to use if all of bases/quals/cigar are None
            bases: an optional string of bases for the read
            quals: an optional list of qualities for the read
            cigar: an optional cigar string for the read
        """

        # Do some validation to make sure all defined things have the same lengths
        lengths = set()
        if bases is not None:
            lengths.add(len(bases))
        if quals is not None:
            lengths.add(len(quals))
        if cigar is not None:
            cig = sam.Cigar.from_cigarstring(cigar)
            lengths.add(sum([elem.length_on_query for elem in cig.elements]))

        if not lengths:
            lengths.add(length)

        if len(lengths) != 1:
            raise ValueError("Provided bases/quals/cigar are not length compatible.")

        # Fill in the record, making any parts that were not defined as params
        length = lengths.pop()
        query_quals = array("B", quals if quals else [self.base_quality] * length)
        rec.query_sequence = bases if bases else self._bases(length)
        rec.query_qualities = query_quals
        if not rec.is_unmapped:
            rec.cigarstring = cigar if cigar else f"{length}M"

    def rg(self) -> Dict[str, Any]:
        """Returns the single read group that is defined in the header."""
        # The `RG` field contains a list of read group mappings
        # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
        rgs = cast(List[Dict[str, Any]], self._header["RG"])
        assert len(rgs) == 1, "Header did not contain exactly one read group!"
        return rgs[0]

    def rg_id(self) -> str:
        """Returns the ID of the single read group that is defined in the header."""
        # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
        # field is always a string.
        return cast(str, self.rg()["ID"])

    def add_pair(
        self,
        *,
        name: Optional[str] = None,
        bases1: Optional[str] = None,
        bases2: Optional[str] = None,
        quals1: Optional[List[int]] = None,
        quals2: Optional[List[int]] = None,
        chrom: Optional[str] = None,
        chrom1: Optional[str] = None,
        chrom2: Optional[str] = None,
        start1: int = sam.NO_REF_POS,
        start2: int = sam.NO_REF_POS,
        cigar1: Optional[str] = None,
        cigar2: Optional[str] = None,
        mapq1: Optional[int] = None,
        mapq2: Optional[int] = None,
        strand1: str = "+",
        strand2: str = "-",
        attrs: Optional[Dict[str, Any]] = None,
    ) -> Tuple[AlignedSegment, AlignedSegment]:
        """Generates a new pair of reads, adds them to the internal collection, and returns them.

        Most fields are optional.

        Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
        pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
        where reads map to different contigs. i.e.:

            - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
              the same contig (`chrom`).
            - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
              map to different contigs (`chrom1` and `chrom2`).

        A pair with only one of the two reads mapped can be created by setting only one start
        position. Flags will automatically be set correctly for the unmapped mate.

            - `add_pair(chrom, start1)`
            - `add_pair(chrom1, start1)`
            - `add_pair(chrom, start2)`
            - `add_pair(chrom2, start2)`

        An unmapped pair can be created by calling the method with no parameters (specifically,
        not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
        provided, it will be ignored.

        For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
        or absence of bases, quals, and cigar.  If values are provided for one or more of these
        parameters, the lengths must match, and the length will be used to generate any
        unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
        synthesized based on either the r1_len or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            bases1: The bases for R1. If None is given a random sequence is generated.
            bases2: The bases for R2. If None is given a random sequence is generated.
            quals1: The list of int qualities for R1. If None, the default base quality is used.
            quals2: The list of int qualities for R2. If None, the default base quality is used.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
            chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
            start1: The start position of R1. Defaults to the unmapped value.
            start2: The start position of R2. Defaults to the unmapped value.
            cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
            mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
            mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
            strand1: The strand for R1, either "+" or "-". Defaults to "+".
            strand2: The strand for R2, either "+" or "-". Defaults to "-".
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if either strand field is not "+" or "-"
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
        """

        if strand1 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand1}")
        if strand2 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand2: {strand2}")

        name = name if name is not None else self._next_name()

        # Valid parameterizations for contig mapping (backward compatible):
        # - chrom, start1, start2
        # - chrom, start1
        # - chrom, start2
        # Valid parameterizations for contig mapping (new):
        # - chrom1, start1, chrom2, start2
        # - chrom1, start1
        # - chrom2, start2
        if chrom is not None and (chrom1 is not None or chrom2 is not None):
            raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

        chrom = sam.NO_REF_NAME if chrom is None else chrom

        if start1 != sam.NO_REF_POS:
            chrom1 = next(c for c in (chrom1, chrom) if c is not None)
        else:
            chrom1 = sam.NO_REF_NAME

        if start2 != sam.NO_REF_POS:
            chrom2 = next(c for c in (chrom2, chrom) if c is not None)
        else:
            chrom2 = sam.NO_REF_NAME

        if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
            raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

        if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
            raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

        # Setup R1
        r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
        self._set_flags(r1, read_num=1, strand=strand1)
        self._set_length_dependent_fields(
            rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
        )

        # Setup R2
        r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
        self._set_flags(r2, read_num=2, strand=strand2)
        self._set_length_dependent_fields(
            rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
        )

        # Sync up mate info and we're done!
        sam.set_mate_info(r1, r2)
        self._records.append(r1)
        self._records.append(r2)
        return r1, r2

    def add_single(
        self,
        *,
        name: Optional[str] = None,
        read_num: Optional[int] = None,
        bases: Optional[str] = None,
        quals: Optional[List[int]] = None,
        chrom: str = sam.NO_REF_NAME,
        start: int = sam.NO_REF_POS,
        cigar: Optional[str] = None,
        mapq: Optional[int] = None,
        strand: str = "+",
        secondary: bool = False,
        supplementary: bool = False,
        attrs: Optional[Dict[str, Any]] = None,
    ) -> AlignedSegment:
        """Generates a new single reads, adds them to the internal collection, and returns it.

        Most fields are optional.

        If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
        set to 1 or 2, the read will have it's paired flag set and read number flags set.

        An unmapped read can be created by calling the method with no parameters (specifically,
        not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

        A mapped read is created by providing chrom and start.

        The length of the read is determined based on the presence or absence of bases, quals,
        and cigar.  If values are provided for one or more of these parameters, the lengths must
        match, and the length will be used to generate any unsupplied values.  If none of bases,
        quals, and cigar are provided, all three will be synthesized based on either the r1_len
        or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            read_num: Either None, 1 for R1 or 2 for R2
            bases: The bases for the read. If None is given a random sequence is generated.
            quals: The list of qualities for the read. If None, the default base quality is used.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            start: The start position of the read. Defaults to the unmapped value.
            cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
            strand: The strand for R1, either "+" or "-". Defaults to "+".
            secondary: If true the read will be flagged as secondary
            supplementary: If true the read will be flagged as supplementary
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if strand field is not "+" or "-"
            ValueError: if read_num is not None, 1 or 2
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            AlignedSegment: The record created
        """

        if strand not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand}")
        if read_num not in [None, 1, 2]:
            raise ValueError(f"Invalid value for read_num: {read_num}")

        name = name if name is not None else self._next_name()

        # Setup the read
        read_len = self.r1_len if read_num != 2 else self.r2_len
        rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
        self._set_flags(
            rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
        )
        self._set_length_dependent_fields(
            rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
        )

        self._records.append(rec)
        return rec

    def to_path(  # noqa: C901
        self,
        path: Optional[Path] = None,
        index: bool = True,
        pred: Callable[[AlignedSegment], bool] = lambda r: True,
        tmp_file_type: Optional[sam.SamFileType] = None,
    ) -> Path:
        """Write the accumulated records to a file, sorts & indexes it, and returns the Path.
        If a path is provided, it will be written to, otherwise a temporary file is created
        and returned.

        If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
        (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
        is provided.  See `~pysam` for more details.

        If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
        provided.

        Args:
            path: a path at which to write the file, otherwise a temp file is used.
            index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
                   an index is generated, otherwise not.
            pred: optional predicate to specify which reads should be output
            tmp_file_type: the file type to output when a path is not provided (default is BAM)

        Returns:
            Path: The path to the sorted (and possibly indexed) file.
        """
        if path is not None:
            # Get the file type if a path was given (in this case, a file type may not be
            # provided too)
            if tmp_file_type is not None:
                raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
            tmp_file_type = sam.SamFileType.from_path(path)
        elif tmp_file_type is None:
            # Use the provided file type
            tmp_file_type = sam.SamFileType.BAM

        # Get the extension, and create a path if none was given
        ext = tmp_file_type.extension
        if path is None:
            with NamedTemporaryFile(suffix=ext, delete=False) as fp:
                path = Path(fp.name)

        with NamedTemporaryFile(suffix=ext, delete=True) as fp:
            file_handle: IO
            if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
                file_handle = path.open("w")
            else:
                file_handle = fp.file

            with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
                for rec in self._records:
                    if pred(rec):
                        writer.write(rec)

            samtools_sort_args = ["-o", str(path), fp.name]

            file_handle.close()
            if self._sort_order == SamOrder.QueryName:
                pysam.sort("-n", *samtools_sort_args)
            elif self._sort_order == SamOrder.Coordinate:
                if index and tmp_file_type.indexable:
                    samtools_sort_args.insert(0, "--write-index")
                pysam.sort(*samtools_sort_args)

        return path

    def __len__(self) -> int:
        """Returns the number of records accumulated so far."""
        return len(self._records)

    def to_unsorted_list(self) -> List[pysam.AlignedSegment]:
        """Returns the accumulated records in the order they were created."""
        return list(self._records)

    def to_sorted_list(self) -> List[pysam.AlignedSegment]:
        """Returns the accumulated records in coordinate order."""
        with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
            filename = fp.name
            path = self.to_path(path=Path(filename), index=False)
            bam = sam.reader(path)
            return list(bam)

    @property
    def header(self) -> AlignmentHeader:
        """Returns the builder's SAM header."""
        return self._samheader
Attributes
header property
header: AlignmentHeader

Returns the builder's SAM header.

Functions
__init__
__init__(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: SamOrder = Coordinate) -> None

Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

Parameters:

Name Type Description Default
r1_len Optional[int]

The length of R1s to create unless otherwise specified

None
r2_len Optional[int]

The length of R2s to create unless otherwise specified

None
base_quality int

The base quality of bases to create unless otherwise specified

30
sd Optional[List[Dict[str, Any]]]

a sequence dictionary as a list of dicts; defaults to calling default_sd() if None

None
rg Optional[Dict[str, str]]

a single read group as a dict; defaults to calling default_sd() if None

None
extra_header Optional[Dict[str, Any]]

a dictionary of extra values to add to the header, None otherwise. See pysam.AlignmentHeader for more details.

None
seed int

a seed value for random number/string generation

42
sort_order SamOrder

Order to sort records when writing to file, or output of to_sorted_list()

Coordinate
Source code in fgpyo/sam/builder.py
def __init__(
    self,
    r1_len: Optional[int] = None,
    r2_len: Optional[int] = None,
    base_quality: int = 30,
    mapping_quality: int = 60,
    sd: Optional[List[Dict[str, Any]]] = None,
    rg: Optional[Dict[str, str]] = None,
    extra_header: Optional[Dict[str, Any]] = None,
    seed: int = 42,
    sort_order: SamOrder = SamOrder.Coordinate,
) -> None:
    """Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

    Args:
        r1_len: The length of R1s to create unless otherwise specified
        r2_len: The length of R2s to create unless otherwise specified
        base_quality: The base quality of bases to create unless otherwise specified
        sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
        rg: a single read group as a dict; defaults to calling default_sd() if None
        extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                      `pysam.AlignmentHeader` for more details.
        seed: a seed value for random number/string generation
        sort_order: Order to sort records when writing to file, or output of to_sorted_list()
    """

    self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
    self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
    self.base_quality: int = base_quality
    self.mapping_quality: int = mapping_quality

    if not isinstance(sort_order, SamOrder):
        raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
    self._sort_order = sort_order

    self._header: Dict[str, Any] = {
        "HD": {"VN": "1.5", "SO": sort_order.value},
        "SQ": (sd if sd is not None else SamBuilder.default_sd()),
        "RG": [(rg if rg is not None else SamBuilder.default_rg())],
    }
    if extra_header is not None:
        self._header = {**self._header, **extra_header}
    self._samheader = AlignmentHeader.from_dict(self._header)
    self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

    self._random: Random = Random(seed)
    self._records: List[AlignedSegment] = []
    self._counter: int = 0
__len__
__len__() -> int

Returns the number of records accumulated so far.

Source code in fgpyo/sam/builder.py
def __len__(self) -> int:
    """Returns the number of records accumulated so far."""
    return len(self._records)
add_pair
add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = NO_REF_POS, start2: int = NO_REF_POS, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) -> Tuple[AlignedSegment, AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:

- `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
  the same contig (`chrom`).
- `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
  map to different contigs (`chrom1` and `chrom2`).

A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.

- `add_pair(chrom, start1)`
- `add_pair(chrom1, start1)`
- `add_pair(chrom, start2)`
- `add_pair(chrom2, start2)`

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name Type Description Default
name Optional[str]

The name of the template. If None is given a unique name will be auto-generated.

None
bases1 Optional[str]

The bases for R1. If None is given a random sequence is generated.

None
bases2 Optional[str]

The bases for R2. If None is given a random sequence is generated.

None
quals1 Optional[List[int]]

The list of int qualities for R1. If None, the default base quality is used.

None
quals2 Optional[List[int]]

The list of int qualities for R2. If None, the default base quality is used.

None
chrom Optional[str]

The chromosome to which both reads are mapped. Defaults to the unmapped value.

None
chrom1 Optional[str]

The chromosome to which R1 is mapped. If None, chrom is used.

None
chrom2 Optional[str]

The chromosome to which R2 is mapped. If None, chrom is used.

None
start1 int

The start position of R1. Defaults to the unmapped value.

NO_REF_POS
start2 int

The start position of R2. Defaults to the unmapped value.

NO_REF_POS
cigar1 Optional[str]

The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

None
cigar2 Optional[str]

The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.

None
mapq1 Optional[int]

Mapping quality for R1. Defaults to self.mapping_quality if None.

None
mapq2 Optional[int]

Mapping quality for R2. Defaults to self.mapping_quality if None.

None
strand1 str

The strand for R1, either "+" or "-". Defaults to "+".

'+'
strand2 str

The strand for R2, either "+" or "-". Defaults to "-".

'-'
attrs Optional[Dict[str, Any]]

An optional dictionary of SAM attribute to place on both R1 and R2.

None

Raises:

Type Description
ValueError

if either strand field is not "+" or "-"

ValueError

if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Type Description
Tuple[AlignedSegment, AlignedSegment]

Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.

Source code in fgpyo/sam/builder.py
def add_pair(
    self,
    *,
    name: Optional[str] = None,
    bases1: Optional[str] = None,
    bases2: Optional[str] = None,
    quals1: Optional[List[int]] = None,
    quals2: Optional[List[int]] = None,
    chrom: Optional[str] = None,
    chrom1: Optional[str] = None,
    chrom2: Optional[str] = None,
    start1: int = sam.NO_REF_POS,
    start2: int = sam.NO_REF_POS,
    cigar1: Optional[str] = None,
    cigar2: Optional[str] = None,
    mapq1: Optional[int] = None,
    mapq2: Optional[int] = None,
    strand1: str = "+",
    strand2: str = "-",
    attrs: Optional[Dict[str, Any]] = None,
) -> Tuple[AlignedSegment, AlignedSegment]:
    """Generates a new pair of reads, adds them to the internal collection, and returns them.

    Most fields are optional.

    Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
    pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
    where reads map to different contigs. i.e.:

        - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
          the same contig (`chrom`).
        - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
          map to different contigs (`chrom1` and `chrom2`).

    A pair with only one of the two reads mapped can be created by setting only one start
    position. Flags will automatically be set correctly for the unmapped mate.

        - `add_pair(chrom, start1)`
        - `add_pair(chrom1, start1)`
        - `add_pair(chrom, start2)`
        - `add_pair(chrom2, start2)`

    An unmapped pair can be created by calling the method with no parameters (specifically,
    not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
    provided, it will be ignored.

    For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
    or absence of bases, quals, and cigar.  If values are provided for one or more of these
    parameters, the lengths must match, and the length will be used to generate any
    unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
    synthesized based on either the r1_len or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        bases1: The bases for R1. If None is given a random sequence is generated.
        bases2: The bases for R2. If None is given a random sequence is generated.
        quals1: The list of int qualities for R1. If None, the default base quality is used.
        quals2: The list of int qualities for R2. If None, the default base quality is used.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
        chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
        start1: The start position of R1. Defaults to the unmapped value.
        start2: The start position of R2. Defaults to the unmapped value.
        cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
        mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
        mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
        strand1: The strand for R1, either "+" or "-". Defaults to "+".
        strand2: The strand for R2, either "+" or "-". Defaults to "-".
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if either strand field is not "+" or "-"
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
    """

    if strand1 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand1}")
    if strand2 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand2: {strand2}")

    name = name if name is not None else self._next_name()

    # Valid parameterizations for contig mapping (backward compatible):
    # - chrom, start1, start2
    # - chrom, start1
    # - chrom, start2
    # Valid parameterizations for contig mapping (new):
    # - chrom1, start1, chrom2, start2
    # - chrom1, start1
    # - chrom2, start2
    if chrom is not None and (chrom1 is not None or chrom2 is not None):
        raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

    chrom = sam.NO_REF_NAME if chrom is None else chrom

    if start1 != sam.NO_REF_POS:
        chrom1 = next(c for c in (chrom1, chrom) if c is not None)
    else:
        chrom1 = sam.NO_REF_NAME

    if start2 != sam.NO_REF_POS:
        chrom2 = next(c for c in (chrom2, chrom) if c is not None)
    else:
        chrom2 = sam.NO_REF_NAME

    if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
        raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

    if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
        raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

    # Setup R1
    r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
    self._set_flags(r1, read_num=1, strand=strand1)
    self._set_length_dependent_fields(
        rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
    )

    # Setup R2
    r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
    self._set_flags(r2, read_num=2, strand=strand2)
    self._set_length_dependent_fields(
        rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
    )

    # Sync up mate info and we're done!
    sam.set_mate_info(r1, r2)
    self._records.append(r1)
    self._records.append(r2)
    return r1, r2
add_single
add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = NO_REF_NAME, start: int = NO_REF_POS, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) -> AlignedSegment

Generates a new single reads, adds them to the internal collection, and returns it.

Most fields are optional.

If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it's paired flag set and read number flags set.

An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.

A mapped read is created by providing chrom and start.

The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name Type Description Default
name Optional[str]

The name of the template. If None is given a unique name will be auto-generated.

None
read_num Optional[int]

Either None, 1 for R1 or 2 for R2

None
bases Optional[str]

The bases for the read. If None is given a random sequence is generated.

None
quals Optional[List[int]]

The list of qualities for the read. If None, the default base quality is used.

None
chrom str

The chromosome to which both reads are mapped. Defaults to the unmapped value.

NO_REF_NAME
start int

The start position of the read. Defaults to the unmapped value.

NO_REF_POS
cigar Optional[str]

The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

None
mapq Optional[int]

Mapping quality for the read. Default to self.mapping_quality if not given.

None
strand str

The strand for R1, either "+" or "-". Defaults to "+".

'+'
secondary bool

If true the read will be flagged as secondary

False
supplementary bool

If true the read will be flagged as supplementary

False
attrs Optional[Dict[str, Any]]

An optional dictionary of SAM attribute to place on both R1 and R2.

None

Raises:

Type Description
ValueError

if strand field is not "+" or "-"

ValueError

if read_num is not None, 1 or 2

ValueError

if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Name Type Description
AlignedSegment AlignedSegment

The record created

Source code in fgpyo/sam/builder.py
def add_single(
    self,
    *,
    name: Optional[str] = None,
    read_num: Optional[int] = None,
    bases: Optional[str] = None,
    quals: Optional[List[int]] = None,
    chrom: str = sam.NO_REF_NAME,
    start: int = sam.NO_REF_POS,
    cigar: Optional[str] = None,
    mapq: Optional[int] = None,
    strand: str = "+",
    secondary: bool = False,
    supplementary: bool = False,
    attrs: Optional[Dict[str, Any]] = None,
) -> AlignedSegment:
    """Generates a new single reads, adds them to the internal collection, and returns it.

    Most fields are optional.

    If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
    set to 1 or 2, the read will have it's paired flag set and read number flags set.

    An unmapped read can be created by calling the method with no parameters (specifically,
    not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

    A mapped read is created by providing chrom and start.

    The length of the read is determined based on the presence or absence of bases, quals,
    and cigar.  If values are provided for one or more of these parameters, the lengths must
    match, and the length will be used to generate any unsupplied values.  If none of bases,
    quals, and cigar are provided, all three will be synthesized based on either the r1_len
    or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        read_num: Either None, 1 for R1 or 2 for R2
        bases: The bases for the read. If None is given a random sequence is generated.
        quals: The list of qualities for the read. If None, the default base quality is used.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        start: The start position of the read. Defaults to the unmapped value.
        cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
        strand: The strand for R1, either "+" or "-". Defaults to "+".
        secondary: If true the read will be flagged as secondary
        supplementary: If true the read will be flagged as supplementary
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if strand field is not "+" or "-"
        ValueError: if read_num is not None, 1 or 2
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        AlignedSegment: The record created
    """

    if strand not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand}")
    if read_num not in [None, 1, 2]:
        raise ValueError(f"Invalid value for read_num: {read_num}")

    name = name if name is not None else self._next_name()

    # Setup the read
    read_len = self.r1_len if read_num != 2 else self.r2_len
    rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
    self._set_flags(
        rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
    )
    self._set_length_dependent_fields(
        rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
    )

    self._records.append(rec)
    return rec
default_rg staticmethod
default_rg() -> Dict[str, str]

Returns the default read group used by the SamBuilder, as a dictionary.

Source code in fgpyo/sam/builder.py
@staticmethod
def default_rg() -> Dict[str, str]:
    """Returns the default read group used by the SamBuilder, as a dictionary."""
    return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}
default_sd staticmethod
default_sd() -> List[Dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns:

Type Description
List[Dict[str, Any]]

A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

Source code in fgpyo/sam/builder.py
@staticmethod
def default_sd() -> List[Dict[str, Any]]:
    """Generates the sequence dictionary that is used by default by SamBuilder.

    Matches the names and lengths of the HG19 reference in use in production.

    Returns:
        A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
    """
    return [
        {"SN": "chr1", "LN": 249250621},
        {"SN": "chr2", "LN": 243199373},
        {"SN": "chr3", "LN": 198022430},
        {"SN": "chr4", "LN": 191154276},
        {"SN": "chr5", "LN": 180915260},
        {"SN": "chr6", "LN": 171115067},
        {"SN": "chr7", "LN": 159138663},
        {"SN": "chr8", "LN": 146364022},
        {"SN": "chr9", "LN": 141213431},
        {"SN": "chr10", "LN": 135534747},
        {"SN": "chr11", "LN": 135006516},
        {"SN": "chr12", "LN": 133851895},
        {"SN": "chr13", "LN": 115169878},
        {"SN": "chr14", "LN": 107349540},
        {"SN": "chr15", "LN": 102531392},
        {"SN": "chr16", "LN": 90354753},
        {"SN": "chr17", "LN": 81195210},
        {"SN": "chr18", "LN": 78077248},
        {"SN": "chr19", "LN": 59128983},
        {"SN": "chr20", "LN": 63025520},
        {"SN": "chr21", "LN": 48129895},
        {"SN": "chr22", "LN": 51304566},
        {"SN": "chrX", "LN": 155270560},
        {"SN": "chrY", "LN": 59373566},
        {"SN": "chrM", "LN": 16571},
    ]
rg
rg() -> Dict[str, Any]

Returns the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py
def rg(self) -> Dict[str, Any]:
    """Returns the single read group that is defined in the header."""
    # The `RG` field contains a list of read group mappings
    # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
    rgs = cast(List[Dict[str, Any]], self._header["RG"])
    assert len(rgs) == 1, "Header did not contain exactly one read group!"
    return rgs[0]
rg_id
rg_id() -> str

Returns the ID of the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py
def rg_id(self) -> str:
    """Returns the ID of the single read group that is defined in the header."""
    # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
    # field is always a string.
    return cast(str, self.rg()["ID"])
to_path
to_path(path: Optional[Path] = None, index: bool = True, pred: Callable[[AlignedSegment], bool] = lambda r: True, tmp_file_type: Optional[SamFileType] = None) -> Path

Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.

If path is provided, tmp_file_type may not be provided. In this case, the file type (SAM/BAM/CRAM) will be automatically determined by the file extension when a path is provided. See ~pysam for more details.

If path is not provided, the file type will default to BAM unless tmp_file_type is provided.

Parameters:

Name Type Description Default
path Optional[Path]

a path at which to write the file, otherwise a temp file is used.

None
index bool

if True and sort_order is Coordinate and output is a BAM/CRAM file, then an index is generated, otherwise not.

True
pred Callable[[AlignedSegment], bool]

optional predicate to specify which reads should be output

lambda r: True
tmp_file_type Optional[SamFileType]

the file type to output when a path is not provided (default is BAM)

None

Returns:

Name Type Description
Path Path

The path to the sorted (and possibly indexed) file.

Source code in fgpyo/sam/builder.py
def to_path(  # noqa: C901
    self,
    path: Optional[Path] = None,
    index: bool = True,
    pred: Callable[[AlignedSegment], bool] = lambda r: True,
    tmp_file_type: Optional[sam.SamFileType] = None,
) -> Path:
    """Write the accumulated records to a file, sorts & indexes it, and returns the Path.
    If a path is provided, it will be written to, otherwise a temporary file is created
    and returned.

    If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
    (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
    is provided.  See `~pysam` for more details.

    If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
    provided.

    Args:
        path: a path at which to write the file, otherwise a temp file is used.
        index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
               an index is generated, otherwise not.
        pred: optional predicate to specify which reads should be output
        tmp_file_type: the file type to output when a path is not provided (default is BAM)

    Returns:
        Path: The path to the sorted (and possibly indexed) file.
    """
    if path is not None:
        # Get the file type if a path was given (in this case, a file type may not be
        # provided too)
        if tmp_file_type is not None:
            raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
        tmp_file_type = sam.SamFileType.from_path(path)
    elif tmp_file_type is None:
        # Use the provided file type
        tmp_file_type = sam.SamFileType.BAM

    # Get the extension, and create a path if none was given
    ext = tmp_file_type.extension
    if path is None:
        with NamedTemporaryFile(suffix=ext, delete=False) as fp:
            path = Path(fp.name)

    with NamedTemporaryFile(suffix=ext, delete=True) as fp:
        file_handle: IO
        if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
            file_handle = path.open("w")
        else:
            file_handle = fp.file

        with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
            for rec in self._records:
                if pred(rec):
                    writer.write(rec)

        samtools_sort_args = ["-o", str(path), fp.name]

        file_handle.close()
        if self._sort_order == SamOrder.QueryName:
            pysam.sort("-n", *samtools_sort_args)
        elif self._sort_order == SamOrder.Coordinate:
            if index and tmp_file_type.indexable:
                samtools_sort_args.insert(0, "--write-index")
            pysam.sort(*samtools_sort_args)

    return path
to_sorted_list
to_sorted_list() -> List[AlignedSegment]

Returns the accumulated records in coordinate order.

Source code in fgpyo/sam/builder.py
def to_sorted_list(self) -> List[pysam.AlignedSegment]:
    """Returns the accumulated records in coordinate order."""
    with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
        filename = fp.name
        path = self.to_path(path=Path(filename), index=False)
        bam = sam.reader(path)
        return list(bam)
to_unsorted_list
to_unsorted_list() -> List[AlignedSegment]

Returns the accumulated records in the order they were created.

Source code in fgpyo/sam/builder.py
def to_unsorted_list(self) -> List[pysam.AlignedSegment]:
    """Returns the accumulated records in the order they were created."""
    return list(self._records)
clipping
Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
...     for rec in fh:
...         before = rec.cigarstring
...         info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         after = rec.cigarstring
...         print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

  • Mate pairs may have incorrect information about their mate's positions
  • Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd
Classes
ClippingInfo

Bases: NamedTuple

Named tuple holding the number of bases clipped on the query and reference respectively.

Source code in fgpyo/sam/clipping.py
class ClippingInfo(NamedTuple):
    """Named tuple holding the number of bases clipped on the query and reference respectively."""

    query_bases_clipped: int
    """The number of query bases in the alignment that were clipped."""

    ref_bases_clipped: int
    """The number of reference bases in the alignment that were clipped."""
Attributes
query_bases_clipped instance-attribute
query_bases_clipped: int

The number of query bases in the alignment that were clipped.

ref_bases_clipped instance-attribute
ref_bases_clipped: int

The number of reference bases in the alignment that were clipped.

Functions
softclip_end_of_alignment_by_query
softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the end of a read's alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the end of a read's alignment.

    Clipping is applied before any existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    # Reverse the cigar and qualities so we can clip from the start
    cigar = Cigar.from_cigartuples(rec.cigartuples).reversed()
    quals = rec.query_qualities
    quals.reverse()
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)

    # Then reverse everything back again
    quals.reverse()
    rec.query_qualities = quals
    rec.cigarstring = str(new_cigar.reversed())

    _cleanup(rec, tags_to_invalidate)
    return clipping_info
softclip_end_of_alignment_by_ref
softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the end of an alignment by bases_to_clip bases on the reference.

    Clipping is applied beforeany existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_end = rec.reference_end - bases_to_clip
    new_query_end = _read_pos_at_ref_pos(rec, new_end, previous=False)
    query_bases_to_clip = rec.query_alignment_end - new_query_end
    return softclip_end_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )
softclip_start_of_alignment_by_query
softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the start of a read's alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the start of a read's alignment.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    cigar = Cigar.from_cigartuples(rec.cigartuples)
    quals = rec.query_qualities
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)
    rec.query_qualities = quals

    rec.reference_start += clipping_info.ref_bases_clipped
    rec.cigarstring = str(new_cigar)
    _cleanup(rec, tags_to_invalidate)
    return clipping_info
softclip_start_of_alignment_by_ref
softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the start of an alignment by bases_to_clip bases on the reference.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_start = rec.reference_start + bases_to_clip
    new_query_start = _read_pos_at_ref_pos(rec, new_start, previous=False)
    query_bases_to_clip = new_query_start - rec.query_alignment_start
    return softclip_start_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )

sequence

Utility Functions for Manipulating DNA and RNA sequences.

This module contains utility functions for manipulating DNA and RNA sequences.

levenshtein and hamming functions are included for convenience. If you are performing many distance calculations, using a C based method is preferable. ex. https://pypi.org/project/Distance/

Functions

complement
complement(base: str) -> str

Returns the complement of any base.

Source code in fgpyo/sequence.py
def complement(base: str) -> str:
    """Returns the complement of any base."""
    if len(base) != 1:
        raise ValueError(f"complement() may only be called with 1-character strings: {base}")
    else:
        return _COMPLEMENTS[base]
gc_content
gc_content(bases: str) -> float

Calculates the fraction of G and C bases in a sequence.

Source code in fgpyo/sequence.py
def gc_content(bases: str) -> float:
    """Calculates the fraction of G and C bases in a sequence."""
    if len(bases) == 0:
        return 0
    gc_count = sum(1 for base in bases if base in "CGcg")
    return gc_count / len(bases)
hamming
hamming(string1: str, string2: str) -> int

Calculates hamming distance between two strings, case sensitive. Strings must be of equal lengths.

Parameters:

Name Type Description Default
string1 str

first string for comparison

required
string2 str

second string for comparison

required

Raises:

Type Description
ValueError

If strings are of different lengths.

Source code in fgpyo/sequence.py
def hamming(string1: str, string2: str) -> int:
    """
    Calculates hamming distance between two strings, case sensitive.
    Strings must be of equal lengths.

    Args:
        string1: first string for comparison
        string2: second string for comparison

    Raises:
        ValueError: If strings are of different lengths.
    """
    if len(string1) != len(string2):
        raise ValueError(
            "Hamming distance requires two strings of equal lengths."
            f"Received {string1} and {string2}."
        )
    return sum(c1 != c2 for c1, c2 in zip(string1, string2))
levenshtein
levenshtein(string1: str, string2: str) -> int

Calculates levenshtein distance between two strings, case sensitive.

Parameters:

Name Type Description Default
string1 str

first string for comparison

required
string2 str

second string for comparison

required
Source code in fgpyo/sequence.py
def levenshtein(string1: str, string2: str) -> int:
    """
    Calculates levenshtein distance between two strings, case sensitive.

    Args:
        string1: first string for comparison
        string2: second string for comparison

    """
    N: int = len(string1)
    M: int = len(string2)
    if N == 0 or M == 0:
        return max(N, M)
    # Initialize N + 1 x M + 1 matrix with final row/column representing the empty string.
    # Fill in initial values for empty string sub-problem comparisons.
    #   A D C "
    # A - - - 3
    # B - - - 2
    # C - - - 1
    # " 3 2 1 0
    matrix: List[List[int]] = [[int()] * (M + 1) for _ in range(N + 1)]
    for j in range(M + 1):
        matrix[N][j] = M - j
    for i in range(N + 1):
        matrix[i][M] = N - i
    # Fill in matrix from bottom up using previous sub-problem solutions.
    #   A D C "      A D C "      A D C "      A D C "      A D C "
    # A - - - 3    A - - - 3    A - - 2 3    A - 2 2 3    A 1 2 2 3
    # B - - - 2 -> B - - 1 2 -> B - 1 1 2 -> B 2 1 1 2 -> B 2 1 1 2
    # C - - 0 1    C - 1 0 1    C 2 1 0 1    C 2 1 0 1    C 2 1 0 1
    # " 3 2 1 0    " 3 2 1 0    " 3 2 1 0    " 3 2 1 0    " 3 2 1 0
    for i in range(N - 1, -1, -1):
        for j in range(M - 1, -1, -1):
            if string1[i] == string2[j]:
                matrix[i][j] = matrix[i + 1][j + 1]  # No Operation
            else:
                matrix[i][j] = 1 + min(
                    matrix[i + 1][j],  # Deletion
                    matrix[i][j + 1],  # Insertion
                    matrix[i + 1][j + 1],  # Substitution
                )
    return matrix[0][0]
longest_dinucleotide_run_length
longest_dinucleotide_run_length(bases: str) -> int

Number of bases in the longest dinucleotide run in a primer.

A dinucleotide run is when two nucleotides are repeated in tandem. For example, TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.

Parameters:

Name Type Description Default
bases str

the bases over which to compute

required
Return

the number of bases in the longest dinuc repeat (NOT the number of repeat units)

Source code in fgpyo/sequence.py
def longest_dinucleotide_run_length(bases: str) -> int:
    """Number of bases in the longest dinucleotide run in a primer.

    A dinucleotide run is when two nucleotides are repeated in tandem. For example,
    TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.

    Args:
        bases: the bases over which to compute

    Return:
        the number of bases in the longest dinuc repeat (NOT the number of repeat units)
    """
    return longest_multinucleotide_run_length(bases=bases, repeat_unit_length=2)
longest_homopolymer_length
longest_homopolymer_length(bases: str) -> int

Calculates the length of the longest homopolymer in the input sequence.

Parameters:

Name Type Description Default
bases str

the bases over which to compute

required
Return

the length of the longest homopolymer

Source code in fgpyo/sequence.py
def longest_homopolymer_length(bases: str) -> int:
    """Calculates the length of the longest homopolymer in the input sequence.

    Args:
        bases: the bases over which to compute

    Return:
        the length of the longest homopolymer
    """
    cur_length: int = 0
    i = 0
    # NB: if we have found a homopolymer of length `min_hp`, then we do not need
    # to examine the last `min_hp` bases since we'll never find a longer one.
    bases_len = len(bases)
    while i < bases_len - cur_length:
        base = bases[i].upper()
        j = i + 1
        while j < bases_len and bases[j].upper() == base:
            j += 1
        cur_length = max(cur_length, j - i)
        # skip over all the bases in the current homopolymer
        i = j
    return cur_length
longest_hp_length
longest_hp_length(bases: str) -> int

Calculates the length of the longest homopolymer in the input sequence.

Parameters:

Name Type Description Default
bases str

the bases over which to compute

required
Return

the length of the longest homopolymer

Source code in fgpyo/sequence.py
def longest_hp_length(bases: str) -> int:
    """Calculates the length of the longest homopolymer in the input sequence.

    Args:
        bases: the bases over which to compute

    Return:
        the length of the longest homopolymer
    """
    return longest_homopolymer_length(bases=bases)
longest_multinucleotide_run_length
longest_multinucleotide_run_length(bases: str, repeat_unit_length: int) -> int

Number of bases in the longest multi-nucleotide run.

A multi-nucleotide run is when N nucleotides are repeated in tandem. For example, TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs, returns 0.

Parameters:

Name Type Description Default
bases str

the bases over which to compute

required
repeat_unit_length int

the length of the multi-nucleotide repetitive unit (must be > 0)

required

Returns:

Type Description
int

the number of bases in the longest multinucleotide repeat (NOT the number of repeat units)

Source code in fgpyo/sequence.py
def longest_multinucleotide_run_length(bases: str, repeat_unit_length: int) -> int:
    """Number of bases in the longest multi-nucleotide run.

    A multi-nucleotide run is when N nucleotides are repeated in tandem. For example,
    TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs,
    returns 0.

    Args:
        bases: the bases over which to compute
        repeat_unit_length: the length of the multi-nucleotide repetitive unit (must be > 0)

    Returns:
        the number of bases in the longest multinucleotide repeat (NOT the number of repeat units)
    """
    if repeat_unit_length <= 0:
        raise ValueError(f"repeat_unit_length must be > 0, found: {repeat_unit_length}")
    elif len(bases) < repeat_unit_length:
        return 0
    elif len(bases) == repeat_unit_length:
        return repeat_unit_length
    elif repeat_unit_length == 1:
        return longest_homopolymer_length(bases=bases)

    best_length: int = 0
    start = 0  # the start index of the current multi-nucleotide run
    # Note: using `< len(bases) - 1` instead of `< len(bases)` is intentional.
    # The algorithm processes overlapping windows and will capture repeats at the sequence end
    # through the sliding window approach, avoiding potential off-by-one errors.
    while start < len(bases) - 1:
        # get the dinuc bases
        dinuc = bases[start : start + repeat_unit_length].upper()
        # keep going while there are more di-nucs
        end = start + repeat_unit_length
        # The same boundary logic applies here - the sliding window captures all valid repeats
        while end < len(bases) - 1 and dinuc == bases[end : end + repeat_unit_length].upper():
            end += repeat_unit_length
        cur_length = end - start
        # update the longest total run length
        best_length = max(best_length, cur_length)
        # move to the next start
        if cur_length <= repeat_unit_length:  # only one repeat unit found, move the start by 1bp
            start += 1
        else:  # multiple repeats found, skip to the last base of the current run
            start += cur_length - 1

    return best_length
reverse_complement
reverse_complement(bases: str) -> str

Reverse complements a base sequence.

Parameters:

Name Type Description Default
bases str

the bases to be reverse complemented.

required

Returns:

Type Description
str

the reverse complement of the provided base string

Source code in fgpyo/sequence.py
def reverse_complement(bases: str) -> str:
    """Reverse complements a base sequence.

    Arguments:
        bases: the bases to be reverse complemented.

    Returns:
        the reverse complement of the provided base string
    """
    rev_comp = bases.translate(_COMPLEMENTS_TABLE)[::-1]
    if len(rev_comp) != len(bases):
        # There were invalid characters that weren't translated.
        # Raise KeyError with all the invalid bases.
        bad_bases = "".join({base for base in bases if base not in _COMPLEMENTS})
        raise KeyError(f"Invalid bases found: {bad_bases}")
    return rev_comp

util

Modules

inspect
Attributes
FieldType module-attribute
FieldType: TypeAlias = Union[Field, Attribute]

TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass

Functions
attr_from
attr_from(cls: Type[_AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> _AttrFromType

Builds an attr or dataclasses class from key-word arguments

Parameters:

Name Type Description Default
cls Type[_AttrFromType]

the attr or dataclasses class to be built

required
kwargs Dict[str, str]

a dictionary of keyword arguments

required
parsers Optional[Dict[type, Callable[[str], Any]]]

a dictionary of parser functions to apply to specific types

None
Source code in fgpyo/util/inspect.py
def attr_from(
    cls: Type[_AttrFromType],
    kwargs: Dict[str, str],
    parsers: Optional[Dict[type, Callable[[str], Any]]] = None,
) -> _AttrFromType:
    """Builds an attr or dataclasses class from key-word arguments

    Args:
        cls: the attr or dataclasses class to be built
        kwargs: a dictionary of keyword arguments
        parsers: a dictionary of parser functions to apply to specific types

    """
    return_values: Dict[str, Any] = {}
    for attribute in get_fields(cls):  # type: ignore[arg-type]
        return_value: Any
        if attribute.name in kwargs:
            str_value: str = kwargs[attribute.name]
            set_value: bool = False

            # Use the converter if provided
            converter = getattr(attribute, "converter", None)
            if converter is not None:
                return_value = converter(str_value)
                set_value = True

            # try getting a known parser
            if not set_value:
                try:
                    parser = _get_parser(cls=cls, type_=attribute.type, parsers=parsers)
                    return_value = parser(str_value)
                    set_value = True
                except ParserNotFoundException:
                    pass

            # try setting by casting
            # Note that while bools *can* be cast from string, all non-empty strings evaluate to
            # True, because python, so we need to check for that explicitly
            if not set_value and attribute.type is not None and attribute.type is not bool:
                try:
                    return_value = attribute.type(str_value)  # type: ignore[operator]
                    set_value = True
                except (ValueError, TypeError):
                    pass

            # fail otherwise
            assert set_value, (
                f"Do not know how to convert string to {attribute.type} for value: {str_value}"
            )
        else:  # no value, check for a default
            assert attribute.default is not None or _attribute_is_optional(attribute), (
                f"No value given and no default for attribute `{attribute.name}`"
            )
            return_value = attribute.default
            # when the default is attr.NOTHING, just use None
            if return_value in MISSING:
                return_value = None

        return_values[attribute.name] = return_value

    return cls(**return_values)
dict_parser
dict_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial

Returns a function that parses a stringified dict into a Dict of the correct type.

Parameters:

Name Type Description Default
cls Type

the type of the class object this is being parsed for (used to get default val for parsers)

required
type_ TypeAlias

the type of the attribute to be parsed parsers: an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)

required
Source code in fgpyo/util/inspect.py
def dict_parser(
    cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None
) -> partial:
    """
    Returns a function that parses a stringified dict into a `Dict` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
            parsers: an optional mapping from type to the function to use for parsing that type
            (allows for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 2, "Dict object must have exactly 2 subtypes per PEP specification!"
    (key_parser, val_parser) = (
        _get_parser(
            cls,
            subtypes[0],
            parsers,
        ),
        _get_parser(
            cls,
            subtypes[1],
            parsers,
        ),
    )

    def dict_parse(dict_string: str) -> Dict[Any, Any]:
        """
        Parses a dictionary value (can do so recursively)
        """
        assert dict_string[0] == "{", "Dict val improperly formatted"
        assert dict_string[-1] == "}", "Dict val improprly formatted"
        dict_string = dict_string[1:-1]
        if len(dict_string) == 0:
            return {}
        else:
            outer_splits = split_at_given_level(dict_string, split_delim=",")
            out_dict = {}
            for outer_split in outer_splits:
                inner_splits = split_at_given_level(outer_split, split_delim=";")
                assert len(inner_splits) % 2 == 0, (
                    "Inner splits of dict didn't have matched key val pairs"
                )
                for i in range(0, len(inner_splits), 2):
                    key = key_parser(inner_splits[i])
                    if key in out_dict:
                        raise ValueError("Duplicate key found in dict: {}".format(key))
                    out_dict[key] = val_parser(inner_splits[i + 1])
            return out_dict

    return functools.partial(dict_parse)
get_fields
get_fields(cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]]) -> Tuple[FieldType, ...]

Get the fields tuple from either a dataclasses or attr dataclass (or instance)

Source code in fgpyo/util/inspect.py
def get_fields(
    cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]],
) -> Tuple[FieldType, ...]:
    """Get the fields tuple from either a dataclasses or attr dataclass (or instance)"""
    if is_dataclasses_class(cls):
        return get_dataclasses_fields(cls)
    elif is_attr_class(cls):  # type: ignore[arg-type]
        return get_attr_fields(cls)  # type: ignore[arg-type, no-any-return]
    else:
        raise TypeError("cls must a dataclasses or attr class")
get_fields_dict
get_fields_dict(cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]]) -> Mapping[str, FieldType]

Get the fields dict from either a dataclasses or attr dataclass (or instance)

Source code in fgpyo/util/inspect.py
def get_fields_dict(
    cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]],
) -> Mapping[str, FieldType]:
    """Get the fields dict from either a dataclasses or attr dataclass (or instance)"""
    if is_dataclasses_class(cls):
        return _get_dataclasses_fields_dict(cls)
    elif is_attr_class(cls):  # type: ignore[arg-type]
        return get_attr_fields_dict(cls)  # type: ignore[arg-type]
    else:
        raise TypeError("cls must a dataclasses or attr class")
is_attr_class
is_attr_class(cls: type) -> bool

Return True if the class is an attr class, and False otherwise

Source code in fgpyo/util/inspect.py
def is_attr_class(cls: type) -> bool:
    """Return True if the class is an attr class, and False otherwise"""
    return hasattr(cls, "__attrs_attrs__")
list_parser
list_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial

Returns a function that parses a "stringified" list into a List of the correct type.

Parameters:

Name Type Description Default
cls Type

the type of the class object this is being parsed for (used to get default val for parsers)

required
type_ TypeAlias

the type of the attribute to be parsed

required
parsers Optional[Dict[type, Callable[[str], Any]]]

an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)

None
Source code in fgpyo/util/inspect.py
def list_parser(
    cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None
) -> partial:
    """
    Returns a function that parses a "stringified" list into a `List` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 1, "Lists are allowed only one subtype per PEP specification!"
    subtype_parser = _get_parser(
        cls,
        subtypes[0],
        parsers,
    )
    return functools.partial(
        lambda s: list(
            []
            if s == ""
            else [subtype_parser(item) for item in list(split_at_given_level(s, split_delim=","))]
        )
    )
set_parser
set_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial

Returns a function that parses a stringified set into a Set of the correct type.

Parameters:

Name Type Description Default
cls Type

the type of the class object this is being parsed for (used to get default val for parsers)

required
type_ TypeAlias

the type of the attribute to be parsed

required
parsers Optional[Dict[type, Callable[[str], Any]]]

an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)

None
Source code in fgpyo/util/inspect.py
def set_parser(
    cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None
) -> partial:
    """
    Returns a function that parses a stringified set into a `Set` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 1, "Sets are allowed only one subtype per PEP specification!"
    subtype_parser = _get_parser(
        cls,
        subtypes[0],
        parsers,
    )
    return functools.partial(
        lambda s: set(
            set({})
            if s == "{}"
            else [
                subtype_parser(item) for item in set(split_at_given_level(s[1:-1], split_delim=","))
            ]
        )
    )
split_at_given_level
split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) -> List[str]

Splits a nested field by its outer-most level

Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth

Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO

Source code in fgpyo/util/inspect.py
def split_at_given_level(
    field: str,
    split_delim: str = ",",
    increase_depth_chars: Iterable[str] = ("{", "(", "["),
    decrease_depth_chars: Iterable[str] = ("}", ")", "]"),
) -> List[str]:
    """
    Splits a nested field by its outer-most level

    Note that this method may produce incorrect results fields containing strings containing
    unpaired characters that increase or decrease the depth

    Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO
    """

    outer_depth_of_split = 0
    current_outer_splits = []
    out_vals: List[str] = []
    for high_level_split in field.split(split_delim):
        increase_in_depth = 0
        for char in increase_depth_chars:
            increase_in_depth += high_level_split.count(char)

        decrease_in_depth = 0
        for char in decrease_depth_chars:
            decrease_in_depth += high_level_split.count(char)
        outer_depth_of_split += increase_in_depth - decrease_in_depth

        assert outer_depth_of_split >= 0, "Unpaired depth character! Likely incorrect output"

        current_outer_splits.append(high_level_split)
        if outer_depth_of_split == 0:
            out_vals.append(split_delim.join(current_outer_splits))
            current_outer_splits = []
    assert outer_depth_of_split == 0, "Unpaired depth character! Likely incorrect output!"
    return out_vals
tuple_parser
tuple_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial

Returns a function that parses a stringified tuple into a Tuple of the correct type.

Parameters:

Name Type Description Default
cls Type

the type of the class object this is being parsed for (used to get default val for parsers)

required
type_ TypeAlias

the type of the attribute to be parsed

required
parsers Optional[Dict[type, Callable[[str], Any]]]

an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)

None
Source code in fgpyo/util/inspect.py
def tuple_parser(
    cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None
) -> partial:
    """
    Returns a function that parses a stringified tuple into a `Tuple` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtype_parsers = [
        _get_parser(
            cls,
            subtype,
            parsers,
        )
        for subtype in typing.get_args(type_)
    ]

    def tuple_parse(tuple_string: str) -> Tuple[Any, ...]:
        """
        Parses a dictionary value (can do so recursively)
        Note that this tool will fail on tuples containing strings containing
        unpaired '{', or '}' characters
        """
        assert tuple_string[0] == "(", "Tuple val improperly formatted"
        assert tuple_string[-1] == ")", "Tuple val improperly formatted"
        tuple_string = tuple_string[1:-1]
        if len(tuple_string) == 0:
            return ()
        else:
            val_strings = split_at_given_level(tuple_string, split_delim=",")
            return tuple(parser(val_str) for parser, val_str in zip(subtype_parsers, val_strings))

    return functools.partial(tuple_parse)
Modules
logging
Methods for setting up logging for tools.
Progress Logging Examples

Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging progress is useful to not only log how many inputs have been consumed, but also their genomic coordinate. ProgressLogger() can log progress every fixed number of records. Logging can be written to logging.Logger as well as custom print method.

>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
...     printer=lambda s: logged_lines.append(s),
...     verb="recorded",
...     noun="items",
...     unit=2
... )
>>> progress.record(reference_name="chr1", position=1)  # does not log
False
>>> progress.record(reference_name="chr1", position=2)  # logs
True
>>> progress.record(reference_name="chr1", position=3)  # does not log
False
>>> progress.log_last()  # will log the last recorded item, if not previously logged
True
>>> logged_lines  # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']
Classes
ProgressLogger

Bases: AbstractContextManager

A little class to track progress.

This will output a log message every unit number times recorded.

Attributes:

Name Type Description
printer Callable[[str], Any]

either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string

noun str

the noun to use in the log message

verb str

the verb to use in the log message

unit int

the number of items for every log message

count int

the total count of items recorded

Source code in fgpyo/util/logging.py
class ProgressLogger(AbstractContextManager):
    """A little class to track progress.

    This will output a log message every `unit` number times recorded.

    Attributes:
        printer: either a Logger (in which case progress will be printed at Info) or a lambda
            that consumes a single string
        noun: the noun to use in the log message
        verb: the verb to use in the log message
        unit: the number of items for every log message
        count: the total count of items recorded
    """

    def __init__(
        self,
        printer: Union[Logger, Callable[[str], Any]],
        noun: str = "records",
        verb: str = "Read",
        unit: int = 100000,
    ) -> None:
        self.printer: Callable[[str], Any]
        if isinstance(printer, Logger):
            self.printer = lambda s: printer.info(s)
        else:
            self.printer = printer
        self.noun: str = noun
        self.verb: str = verb
        self.unit: int = unit
        self.count: int = 0
        self._count_mod_unit: int = 0
        self._last_reference_name: Optional[str] = None
        self._last_position: Optional[int] = None

    def __exit__(
        self, ex_type: Optional[Any], ex_value: Optional[Any], traceback: Optional[Any]
    ) -> Literal[False]:
        if ex_value is None:
            self.log_last()
        return False

    def record(
        self,
        reference_name: Optional[str] = None,
        position: Optional[int] = None,
    ) -> bool:
        """Record an item at a given genomic coordinate.
        Args:
            reference_name: the reference name of the item
            position: the 1-based start position of the item
        Returns:
            true if a message was logged, false otherwise
        """
        self.count += 1
        self._count_mod_unit += 1
        self._last_reference_name = reference_name
        self._last_position = None if position is None or position <= 0 else position
        if self._count_mod_unit == self.unit:
            self._count_mod_unit = 0
            self._log(refname=self._last_reference_name, position=self._last_position)
            return True
        else:
            return False

    def record_alignment(
        self,
        rec: AlignedSegment,
    ) -> bool:
        """Correctly record pysam.AlignedSegments (zero-based coordinates).

        Args:
            rec: pysam.AlignedSegment object

        Returns:
            true if a message was logged, false otherwise
        """
        if rec.reference_start is None:
            return self.record(None, None)
        else:
            return self.record(rec.reference_name, rec.reference_start + 1)

    def record_alignments(
        self,
        recs: Iterable[AlignedSegment],
    ) -> bool:
        """Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

        Args:
            recs: pysam.AlignedSegment objects

        Returns:
            true if a message was logged, false otherwise
        """
        logged_message: bool = False
        for rec in recs:
            logged_message = self.record_alignment(rec) or logged_message
        return logged_message

    def _log(
        self,
        refname: Optional[str] = None,
        position: Optional[int] = None,
    ) -> None:
        """Helper method to print the log message.

        Args:
            refname: the name of the reference of the item
            position: the 1-based start position of the item

        Returns:
            None
        """
        coordinate: str
        if refname is None and position is None:
            coordinate = "NA"
        else:
            assert refname is not None and position is not None, f"{refname} {position}"
            coordinate = f"{refname}:{position:,d}"

        self.printer(f"{self.verb} {self.count:,d} {self.noun}: {coordinate}")

        return None

    def log_last(
        self,
    ) -> bool:
        """Force logging the last record, for example when progress has completed."""
        if self._count_mod_unit != 0:
            self._log(refname=self._last_reference_name, position=self._last_position)
            return True
        else:
            return False
Functions
log_last
log_last() -> bool

Force logging the last record, for example when progress has completed.

Source code in fgpyo/util/logging.py
def log_last(
    self,
) -> bool:
    """Force logging the last record, for example when progress has completed."""
    if self._count_mod_unit != 0:
        self._log(refname=self._last_reference_name, position=self._last_position)
        return True
    else:
        return False
record
record(reference_name: Optional[str] = None, position: Optional[int] = None) -> bool

Record an item at a given genomic coordinate. Args: reference_name: the reference name of the item position: the 1-based start position of the item Returns: true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py
def record(
    self,
    reference_name: Optional[str] = None,
    position: Optional[int] = None,
) -> bool:
    """Record an item at a given genomic coordinate.
    Args:
        reference_name: the reference name of the item
        position: the 1-based start position of the item
    Returns:
        true if a message was logged, false otherwise
    """
    self.count += 1
    self._count_mod_unit += 1
    self._last_reference_name = reference_name
    self._last_position = None if position is None or position <= 0 else position
    if self._count_mod_unit == self.unit:
        self._count_mod_unit = 0
        self._log(refname=self._last_reference_name, position=self._last_position)
        return True
    else:
        return False
record_alignment
record_alignment(rec: AlignedSegment) -> bool

Correctly record pysam.AlignedSegments (zero-based coordinates).

Parameters:

Name Type Description Default
rec AlignedSegment

pysam.AlignedSegment object

required

Returns:

Type Description
bool

true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py
def record_alignment(
    self,
    rec: AlignedSegment,
) -> bool:
    """Correctly record pysam.AlignedSegments (zero-based coordinates).

    Args:
        rec: pysam.AlignedSegment object

    Returns:
        true if a message was logged, false otherwise
    """
    if rec.reference_start is None:
        return self.record(None, None)
    else:
        return self.record(rec.reference_name, rec.reference_start + 1)
record_alignments
record_alignments(recs: Iterable[AlignedSegment]) -> bool

Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

Parameters:

Name Type Description Default
recs Iterable[AlignedSegment]

pysam.AlignedSegment objects

required

Returns:

Type Description
bool

true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py
def record_alignments(
    self,
    recs: Iterable[AlignedSegment],
) -> bool:
    """Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

    Args:
        recs: pysam.AlignedSegment objects

    Returns:
        true if a message was logged, false otherwise
    """
    logged_message: bool = False
    for rec in recs:
        logged_message = self.record_alignment(rec) or logged_message
    return logged_message
Functions
setup_logging
setup_logging(level: str = 'INFO', name: str = 'fgpyo') -> None

Globally configure logging for all modules

Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.

Parameters:

Name Type Description Default
level str

the default level for the logger

'INFO'
name str

the name of the logger

'fgpyo'
Source code in fgpyo/util/logging.py
def setup_logging(level: str = "INFO", name: str = "fgpyo") -> None:
    """Globally configure logging for all modules

    Configures logging to run at a specific level and output messages to stderr with
    useful information preceding the actual log message.

    Args:
        level: the default level for the logger
        name: the name of the logger
    """
    global __FGPYO_LOGGING_SETUP

    with __LOCK:
        if not __FGPYO_LOGGING_SETUP:
            format = (
                f"%(asctime)s {socket.gethostname()} %(name)s:%(funcName)s:%(lineno)s "
                + "[%(levelname)s]: %(message)s"
            )
            handler = logging.StreamHandler()
            handler.setLevel(level)
            handler.setFormatter(logging.Formatter(format))

            logger = logging.getLogger(name)
            logger.setLevel(level)
            logger.addHandler(handler)
        else:
            logging.getLogger(__name__).warn("Logging already initialized.")

        __FGPYO_LOGGING_SETUP = True
metric
Metrics

Module for storing, reading, and writing metric-like tab-delimited information.

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R. For example, a row per person, with columns for age, gender, and address.

The Metric() class makes it easy to read, write, and store one or metrics of the same type, all the while preserving types for each value in a metric. It is an abstract base class decorated by @dataclass, or @attr.s, with attributes storing one or more typed values. If using multiple layers of inheritance, keep in mind that it's not possible to mix these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately initialize the values of the attr superclass.

Examples

Defining a new metric class:

>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

or using attr:

>>> from fgpyo.util.metric import Metric
>>> import attr
>>> from typing import Optional
>>> @attr.s(auto_attribs=True, frozen=True)
... class PersonAttr(Metric["PersonAttr"]):
...     name: str
...     age: int
...     address: Optional[str] = None

Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.

>>> Person.header()
['name', 'age']

Getting the values from a metric class instance. The values are in the same order as the header.

>>> list(Person(name="Alice", age=47).values())
['Alice', 47]

Writing a list of metrics to a file:

>>> metrics = [
...     Person(name="Alice", age=47),
...     Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)  

Then the contents of the written metrics file:

$ column -t /path/to/metrics.txt
name   age
Alice  47
Bob    24

Reading the metrics file back in:

>>> list(Person.read(Path("/path/to/metrics.txt")))  
[Person(name='Alice', age=47), Person(name='Bob', age=24)]

Formatting and parsing the values for custom types is supported by overriding the _parsers() and format_value() methods.

>>> @dataclasses.dataclass(frozen=True)
... class Name:
...     first: str
...     last: str
...     @classmethod
...     def parse(cls, value: str) -> "Name":
...          fields = value.split(" ")
...          return Name(first=fields[0], last=fields[1])
>>> from typing import Dict, Callable, Any
>>> @dataclasses.dataclass(frozen=True)
... class PersonWithName(Metric["PersonWithName"]):
...     name: Name
...     age: int
...     @classmethod
...     def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
...         return {Name: lambda value: Name.parse(value=value)}
...     @classmethod
...     def format_value(cls, value: Any) -> str:
...         if isinstance(value, Name):
...             return f"{value.first} {value.last}"
...         else:
...             return super().format_value(value=value)
>>> PersonWithName.parse(fields=["john doe", "42"])
PersonWithName(name=Name(first='john', last='doe'), age=42)
>>> PersonWithName(name=Name(first='john', last='doe'), age=42).formatted_values()
['john doe', '42']
Classes
Metric

Bases: ABC, Generic[MetricType]

Abstract base class for all metric-like tab-delimited files

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R.

Subclasses of Metric() can support parsing and formatting custom types with _parsers() and format_value().

Source code in fgpyo/util/metric.py
class Metric(ABC, Generic[MetricType]):
    """Abstract base class for all metric-like tab-delimited files

    Metric files are tab-delimited, contain a header, and zero or more rows for metric values.  This
    makes it easy for them to be read in languages like `R`.

    Subclasses of [`Metric()`][fgpyo.util.metric.Metric] can support parsing and
    formatting custom types with `_parsers()` and
    [`format_value()`][fgpyo.util.metric.Metric.format_value].
    """

    @classmethod
    def keys(cls) -> Iterator[str]:
        """An iterator over field names in the same order as the header."""
        for field in inspect.get_fields(cls):  # type: ignore[arg-type]
            yield field.name

    def values(self) -> Iterator[Any]:
        """An iterator over attribute values in the same order as the header."""
        for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
            yield getattr(self, field.name)

    def items(self) -> Iterator[Tuple[str, Any]]:
        """
        An iterator over field names and their corresponding values in the same order as the header.
        """
        for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
            yield (field.name, getattr(self, field.name))

    def formatted_values(self) -> List[str]:
        """An iterator over formatted attribute values in the same order as the header."""
        return [self.format_value(value) for value in self.values()]

    def formatted_items(self) -> List[Tuple[str, str]]:
        """An iterator over formatted attribute values in the same order as the header."""
        return [(key, self.format_value(value)) for key, value in self.items()]

    @classmethod
    def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
        """Mapping of type to a specific parser for that type.  The parser must accept a string
        as a single parameter and return a single value of the given type.  Sub-classes may
        override this method to support custom types."""
        return {}

    @classmethod
    def read(
        cls,
        path: Path,
        ignore_extra_fields: bool = True,
        strip_whitespace: bool = False,
        threads: Optional[int] = None,
    ) -> Iterator[Any]:
        """Reads in zero or more metrics from the given path.

        The metric file must contain a matching header.

        Columns that are not present in the file but are optional in the metric class will
        be default values.

        Args:
            path: the path to the metrics file.
            ignore_extra_fields: True to ignore any extra columns, False to raise an exception.
            strip_whitespace: True to strip leading and trailing whitespace from each field,
                               False to keep as-is.
            threads: the number of threads to use when decompressing gzip files
        """
        parsers = cls._parsers()
        with io.to_reader(path, threads=threads) as reader:
            header: List[str] = reader.readline().rstrip("\r\n").split("\t")
            # check the header
            class_fields = set(cls.header())
            file_fields = set(header)
            missing_from_class = file_fields.difference(class_fields)
            missing_from_file = class_fields.difference(file_fields)

            field_name_to_attribute = inspect.get_fields_dict(cls)  # type: ignore[arg-type]

            # ignore class fields that are missing from the file (via header) if they're optional
            # or have a default
            if len(missing_from_file) > 0:
                fields_with_defaults = [
                    field
                    for field in missing_from_file
                    if inspect._attribute_has_default(field_name_to_attribute[field])
                ]
                # remove optional class fields from the fields
                missing_from_file = missing_from_file.difference(fields_with_defaults)

            # raise an exception if there are non-optional class fields missing from the file
            if len(missing_from_file) > 0:
                raise ValueError(
                    f"In file: {path}, fields in file missing from class '{cls.__name__}': "
                    + ", ".join(missing_from_file)
                )

            # raise an exception if there are fields in the file not in the header, unless they
            # should be ignored.
            if not ignore_extra_fields and len(missing_from_class) > 0:
                raise ValueError(
                    f"In file: {path}, extra fields in file missing from class '{cls.__name__}': "
                    ", ".join(missing_from_file)
                )

            # read the metric lines
            for lineno, line in enumerate(reader, 2):
                # parse the raw values
                values: List[str] = line.rstrip("\r\n").split("\t")
                if strip_whitespace:
                    values = [v.strip() for v in values]

                # raise an exception if there aren't the same number of values as the header
                if len(header) != len(values):
                    raise ValueError(
                        f"In file: {path}, expected {len(header)} columns, got {len(values)} on "
                        f"line {lineno}: {line}"
                    )

                # build the metric
                instance: Metric[MetricType] = inspect.attr_from(
                    cls=cls, kwargs=dict(zip(header, values)), parsers=parsers
                )
                yield instance

    @classmethod
    def parse(cls, fields: List[str]) -> Any:
        """Parses the string-representation of this metric.  One string per attribute should be
        given.

        """
        parsers = cls._parsers()
        header = cls.header()
        assert len(fields) == len(header)
        return inspect.attr_from(cls=cls, kwargs=dict(zip(header, fields)), parsers=parsers)

    @classmethod
    def write(cls, path: Path, *values: MetricType, threads: Optional[int] = None) -> None:
        """Writes zero or more metrics to the given path.

        The header will always be written.

        Args:
            path: Path to the output file.
            values: Zero or more metrics.
            threads: the number of threads to use when compressing gzip files

        """
        with MetricWriter[MetricType](path, metric_class=cls, threads=threads) as writer:
            writer.writeall(values)

    @classmethod
    def header(cls) -> List[str]:
        """The list of header values for the metric."""
        return [a.name for a in inspect.get_fields(cls)]  # type: ignore[arg-type]

    @classmethod
    def format_value(cls, value: Any) -> str:  # noqa: C901
        """The default method to format values of a given type.

        By default, this method will comma-delimit `list`, `tuple`, and `set` types, and apply
        `str` to all others.

        Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs
        delimited by commas.

        In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries
        with '{}'

        Args:
            value: the value to format.
        """
        if issubclass(type(value), Enum):
            return cls.format_value(value.value)
        if isinstance(value, (tuple)):
            if len(value) == 0:
                return "()"
            else:
                return "(" + ",".join(cls.format_value(v) for v in value) + ")"
        if isinstance(value, (list)):
            if len(value) == 0:
                return ""
            else:
                return ",".join(cls.format_value(v) for v in value)
        if isinstance(value, (set)):
            if len(value) == 0:
                return ""
            else:
                return "{" + ",".join(cls.format_value(v) for v in value) + "}"

        elif isinstance(value, dict):
            if len(value) == 0:
                return "{}"
            else:
                return (
                    "{"
                    + ",".join(
                        f"{cls.format_value(k)};{cls.format_value(v)}" for k, v in value.items()
                    )
                    + "}"
                )
        elif isinstance(value, float):
            return f"{round(value, 5)}"
        elif value is None:
            return ""
        else:
            return f"{value}"

    @classmethod
    def to_list(cls, value: str) -> List[Any]:
        """Returns a list value split on comma delimeter."""
        return [] if value == "" else value.split(",")

    @staticmethod
    def fast_concat(*inputs: Path, output: Path) -> None:
        if len(inputs) == 0:
            raise ValueError("No inputs provided")

        headers = [next(io.read_lines(input_path)) for input_path in inputs]
        assert len(set(headers)) == 1, "Input headers do not match"
        io.write_lines(path=output, lines_to_write=set(headers))

        for input_path in inputs:
            io.write_lines(
                path=output, lines_to_write=list(io.read_lines(input_path))[1:], append=True
            )

    @staticmethod
    def _read_header(
        reader: TextIOWrapper,
        delimiter: str = "\t",
        comment_prefix: str = "#",
    ) -> MetricFileHeader:
        """
        Read the header from an open file.

        The first row after any commented or empty lines will be used as the fieldnames.

        Lines preceding the fieldnames will be returned in the `preamble`. Leading and trailing
        whitespace are removed and ignored.

        Args:
            reader: An open, readable file handle.
            delimiter: The delimiter character used to separate fields in the file.
            comment_prefix: The prefix for comment lines in the file.

        Returns:
            A `MetricFileHeader` containing the field names and any preceding lines.

        Raises:
            ValueError: If the file was empty or contained only comments or empty lines.
        """

        preamble: List[str] = []

        for line in reader:
            if line.strip().startswith(comment_prefix) or line.strip() == "":
                # Skip any commented or empty lines before the header
                preamble.append(line.strip())
            else:
                # The first line with any other content is assumed to be the header
                fieldnames = line.strip().split(delimiter)
                break
        else:
            # If the file was empty, kick back an empty header
            fieldnames = []

        return MetricFileHeader(preamble=preamble, fieldnames=fieldnames)
Functions
format_value classmethod
format_value(value: Any) -> str

The default method to format values of a given type.

By default, this method will comma-delimit list, tuple, and set types, and apply str to all others.

Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.

In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries with '{}'

Parameters:

Name Type Description Default
value Any

the value to format.

required
Source code in fgpyo/util/metric.py
@classmethod
def format_value(cls, value: Any) -> str:  # noqa: C901
    """The default method to format values of a given type.

    By default, this method will comma-delimit `list`, `tuple`, and `set` types, and apply
    `str` to all others.

    Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs
    delimited by commas.

    In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries
    with '{}'

    Args:
        value: the value to format.
    """
    if issubclass(type(value), Enum):
        return cls.format_value(value.value)
    if isinstance(value, (tuple)):
        if len(value) == 0:
            return "()"
        else:
            return "(" + ",".join(cls.format_value(v) for v in value) + ")"
    if isinstance(value, (list)):
        if len(value) == 0:
            return ""
        else:
            return ",".join(cls.format_value(v) for v in value)
    if isinstance(value, (set)):
        if len(value) == 0:
            return ""
        else:
            return "{" + ",".join(cls.format_value(v) for v in value) + "}"

    elif isinstance(value, dict):
        if len(value) == 0:
            return "{}"
        else:
            return (
                "{"
                + ",".join(
                    f"{cls.format_value(k)};{cls.format_value(v)}" for k, v in value.items()
                )
                + "}"
            )
    elif isinstance(value, float):
        return f"{round(value, 5)}"
    elif value is None:
        return ""
    else:
        return f"{value}"
formatted_items
formatted_items() -> List[Tuple[str, str]]

An iterator over formatted attribute values in the same order as the header.

Source code in fgpyo/util/metric.py
def formatted_items(self) -> List[Tuple[str, str]]:
    """An iterator over formatted attribute values in the same order as the header."""
    return [(key, self.format_value(value)) for key, value in self.items()]
formatted_values
formatted_values() -> List[str]

An iterator over formatted attribute values in the same order as the header.

Source code in fgpyo/util/metric.py
def formatted_values(self) -> List[str]:
    """An iterator over formatted attribute values in the same order as the header."""
    return [self.format_value(value) for value in self.values()]
header classmethod
header() -> List[str]

The list of header values for the metric.

Source code in fgpyo/util/metric.py
@classmethod
def header(cls) -> List[str]:
    """The list of header values for the metric."""
    return [a.name for a in inspect.get_fields(cls)]  # type: ignore[arg-type]
items
items() -> Iterator[Tuple[str, Any]]

An iterator over field names and their corresponding values in the same order as the header.

Source code in fgpyo/util/metric.py
def items(self) -> Iterator[Tuple[str, Any]]:
    """
    An iterator over field names and their corresponding values in the same order as the header.
    """
    for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
        yield (field.name, getattr(self, field.name))
keys classmethod
keys() -> Iterator[str]

An iterator over field names in the same order as the header.

Source code in fgpyo/util/metric.py
@classmethod
def keys(cls) -> Iterator[str]:
    """An iterator over field names in the same order as the header."""
    for field in inspect.get_fields(cls):  # type: ignore[arg-type]
        yield field.name
parse classmethod
parse(fields: List[str]) -> Any

Parses the string-representation of this metric. One string per attribute should be given.

Source code in fgpyo/util/metric.py
@classmethod
def parse(cls, fields: List[str]) -> Any:
    """Parses the string-representation of this metric.  One string per attribute should be
    given.

    """
    parsers = cls._parsers()
    header = cls.header()
    assert len(fields) == len(header)
    return inspect.attr_from(cls=cls, kwargs=dict(zip(header, fields)), parsers=parsers)
read classmethod
read(path: Path, ignore_extra_fields: bool = True, strip_whitespace: bool = False, threads: Optional[int] = None) -> Iterator[Any]

Reads in zero or more metrics from the given path.

The metric file must contain a matching header.

Columns that are not present in the file but are optional in the metric class will be default values.

Parameters:

Name Type Description Default
path Path

the path to the metrics file.

required
ignore_extra_fields bool

True to ignore any extra columns, False to raise an exception.

True
strip_whitespace bool

True to strip leading and trailing whitespace from each field, False to keep as-is.

False
threads Optional[int]

the number of threads to use when decompressing gzip files

None
Source code in fgpyo/util/metric.py
@classmethod
def read(
    cls,
    path: Path,
    ignore_extra_fields: bool = True,
    strip_whitespace: bool = False,
    threads: Optional[int] = None,
) -> Iterator[Any]:
    """Reads in zero or more metrics from the given path.

    The metric file must contain a matching header.

    Columns that are not present in the file but are optional in the metric class will
    be default values.

    Args:
        path: the path to the metrics file.
        ignore_extra_fields: True to ignore any extra columns, False to raise an exception.
        strip_whitespace: True to strip leading and trailing whitespace from each field,
                           False to keep as-is.
        threads: the number of threads to use when decompressing gzip files
    """
    parsers = cls._parsers()
    with io.to_reader(path, threads=threads) as reader:
        header: List[str] = reader.readline().rstrip("\r\n").split("\t")
        # check the header
        class_fields = set(cls.header())
        file_fields = set(header)
        missing_from_class = file_fields.difference(class_fields)
        missing_from_file = class_fields.difference(file_fields)

        field_name_to_attribute = inspect.get_fields_dict(cls)  # type: ignore[arg-type]

        # ignore class fields that are missing from the file (via header) if they're optional
        # or have a default
        if len(missing_from_file) > 0:
            fields_with_defaults = [
                field
                for field in missing_from_file
                if inspect._attribute_has_default(field_name_to_attribute[field])
            ]
            # remove optional class fields from the fields
            missing_from_file = missing_from_file.difference(fields_with_defaults)

        # raise an exception if there are non-optional class fields missing from the file
        if len(missing_from_file) > 0:
            raise ValueError(
                f"In file: {path}, fields in file missing from class '{cls.__name__}': "
                + ", ".join(missing_from_file)
            )

        # raise an exception if there are fields in the file not in the header, unless they
        # should be ignored.
        if not ignore_extra_fields and len(missing_from_class) > 0:
            raise ValueError(
                f"In file: {path}, extra fields in file missing from class '{cls.__name__}': "
                ", ".join(missing_from_file)
            )

        # read the metric lines
        for lineno, line in enumerate(reader, 2):
            # parse the raw values
            values: List[str] = line.rstrip("\r\n").split("\t")
            if strip_whitespace:
                values = [v.strip() for v in values]

            # raise an exception if there aren't the same number of values as the header
            if len(header) != len(values):
                raise ValueError(
                    f"In file: {path}, expected {len(header)} columns, got {len(values)} on "
                    f"line {lineno}: {line}"
                )

            # build the metric
            instance: Metric[MetricType] = inspect.attr_from(
                cls=cls, kwargs=dict(zip(header, values)), parsers=parsers
            )
            yield instance
to_list classmethod
to_list(value: str) -> List[Any]

Returns a list value split on comma delimeter.

Source code in fgpyo/util/metric.py
@classmethod
def to_list(cls, value: str) -> List[Any]:
    """Returns a list value split on comma delimeter."""
    return [] if value == "" else value.split(",")
values
values() -> Iterator[Any]

An iterator over attribute values in the same order as the header.

Source code in fgpyo/util/metric.py
def values(self) -> Iterator[Any]:
    """An iterator over attribute values in the same order as the header."""
    for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
        yield getattr(self, field.name)
write classmethod
write(path: Path, *values: MetricType, threads: Optional[int] = None) -> None

Writes zero or more metrics to the given path.

The header will always be written.

Parameters:

Name Type Description Default
path Path

Path to the output file.

required
values MetricType

Zero or more metrics.

()
threads Optional[int]

the number of threads to use when compressing gzip files

None
Source code in fgpyo/util/metric.py
@classmethod
def write(cls, path: Path, *values: MetricType, threads: Optional[int] = None) -> None:
    """Writes zero or more metrics to the given path.

    The header will always be written.

    Args:
        path: Path to the output file.
        values: Zero or more metrics.
        threads: the number of threads to use when compressing gzip files

    """
    with MetricWriter[MetricType](path, metric_class=cls, threads=threads) as writer:
        writer.writeall(values)
MetricFileHeader dataclass

Header of a file.

A file's header contains an optional preamble, consisting of lines prefixed by a comment character and/or empty lines, and a required row of fieldnames before the data rows begin.

Attributes:

Name Type Description
preamble List[str]

A list of any lines preceding the fieldnames.

fieldnames List[str]

The field names specified in the final line of the header.

Source code in fgpyo/util/metric.py
@dataclass(frozen=True)
class MetricFileHeader:
    """
    Header of a file.

    A file's header contains an optional preamble, consisting of lines prefixed by a comment
    character and/or empty lines, and a required row of fieldnames before the data rows begin.

    Attributes:
        preamble: A list of any lines preceding the fieldnames.
        fieldnames: The field names specified in the final line of the header.
    """

    preamble: List[str]
    fieldnames: List[str]
MetricWriter

Bases: Generic[MetricType], AbstractContextManager

Source code in fgpyo/util/metric.py
class MetricWriter(Generic[MetricType], AbstractContextManager):
    _metric_class: Type[Metric]
    _fieldnames: List[str]
    _fout: TextIOWrapper
    _writer: DictWriter

    def __init__(
        self,
        filename: Union[Path, str],
        metric_class: Type[Metric],
        append: bool = False,
        delimiter: str = "\t",
        include_fields: Optional[List[str]] = None,
        exclude_fields: Optional[List[str]] = None,
        lineterminator: str = "\n",
        threads: Optional[int] = None,
    ) -> None:
        """
        Args:
            filename: Path to the file to write.
            metric_class: Metric class.
            append: If `True`, the file will be appended to. Otherwise, the specified file will be
                overwritten.
            delimiter: The output file delimiter.
            include_fields: If specified, only the listed fieldnames will be included when writing
                records to file. Fields will be written in the order provided.
                May not be used together with `exclude_fields`.
            exclude_fields: If specified, any listed fieldnames will be excluded when writing
                records to file.
                May not be used together with `include_fields`.
            lineterminator: The string used to terminate lines produced by the MetricWriter.
                Default = "\n".
            threads: the number of threads to use when compressing gzip files

        Raises:
            TypeError: If the provided metric class is not a dataclass- or attr-decorated
                subclass of `Metric`.
            AssertionError: If the provided filepath is not writable.
            AssertionError: If `append=True` and the provided file is not readable. (When appending,
                we check to ensure that the header matches the specified metric class. The file must
                be readable to get the header.)
            ValueError: If `append=True` and the provided file is a FIFO (named pipe).
            ValueError: If `append=True` and the provided file does not include a header.
            ValueError: If `append=True` and the header of the provided file does not match the
                specified metric class and the specified include/exclude fields.
        """

        filepath: Path = Path(filename)
        if (filepath.is_fifo() or filepath.is_char_device()) and append:
            raise ValueError("Cannot append to stdout, stderr, or other named pipe or stream")

        ordered_fieldnames: List[str] = _validate_and_generate_final_output_fieldnames(
            metric_class=metric_class,
            include_fields=include_fields,
            exclude_fields=exclude_fields,
        )

        _assert_is_metric_class(metric_class)
        io.assert_path_is_writable(filepath)
        if append:
            io.assert_path_is_readable(filepath)
            _assert_file_header_matches_metric(
                path=filepath,
                metric_class=metric_class,
                ordered_fieldnames=ordered_fieldnames,
                delimiter=delimiter,
            )

        self._metric_class = metric_class
        self._fieldnames = ordered_fieldnames
        self._fout = io.to_writer(filepath, append=append, threads=threads)
        self._writer = DictWriter(
            f=self._fout,
            fieldnames=self._fieldnames,
            delimiter=delimiter,
            lineterminator=lineterminator,
        )

        # If we aren't appending to an existing file, write the header before any rows
        if not append:
            self._writer.writeheader()

    def __enter__(self) -> "MetricWriter":
        return self

    def __exit__(
        self,
        exc_type: Type[BaseException],
        exc_value: BaseException,
        traceback: TracebackType,
    ) -> None:
        self.close()
        super().__exit__(exc_type, exc_value, traceback)

    def close(self) -> None:
        """Close the underlying file handle."""
        self._fout.close()

    def write(self, metric: MetricType) -> None:
        """
        Write a single Metric instance to file.

        The Metric is converted to a dictionary and then written using the underlying
        `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
        `exclude_fields` arguments, the fields of the Metric are subset and/or reordered
        accordingly before writing.

        Args:
            metric: An instance of the specified Metric.

        Raises:
            TypeError: If the provided `metric` is not an instance of the Metric class used to
                parametrize the writer.
        """

        # Serialize the Metric to a dict for writing by the underlying `DictWriter`
        row = {fieldname: val for fieldname, val in metric.formatted_items()}

        # Filter and/or re-order output fields if necessary
        row = {fieldname: row[fieldname] for fieldname in self._fieldnames}

        self._writer.writerow(row)

    def writeall(self, metrics: Iterable[MetricType]) -> None:
        """
        Write multiple Metric instances to file.

        Each Metric is converted to a dictionary and then written using the underlying
        `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
        `exclude_fields` arguments, the attributes of each Metric are subset and/or reordered
        accordingly before writing.

        Args:
            metrics: A sequence of instances of the specified Metric.
        """
        for metric in metrics:
            self.write(metric)
Functions
__init__
__init__(filename: Union[Path, str], metric_class: Type[Metric], append: bool = False, delimiter: str = '\t', include_fields: Optional[List[str]] = None, exclude_fields: Optional[List[str]] = None, lineterminator: str = '\n', threads: Optional[int] = None) -> None
    Args:
        filename: Path to the file to write.
        metric_class: Metric class.
        append: If `True`, the file will be appended to. Otherwise, the specified file will be
            overwritten.
        delimiter: The output file delimiter.
        include_fields: If specified, only the listed fieldnames will be included when writing
            records to file. Fields will be written in the order provided.
            May not be used together with `exclude_fields`.
        exclude_fields: If specified, any listed fieldnames will be excluded when writing
            records to file.
            May not be used together with `include_fields`.
        lineterminator: The string used to terminate lines produced by the MetricWriter.
            Default = "

". threads: the number of threads to use when compressing gzip files

    Raises:
        TypeError: If the provided metric class is not a dataclass- or attr-decorated
            subclass of `Metric`.
        AssertionError: If the provided filepath is not writable.
        AssertionError: If `append=True` and the provided file is not readable. (When appending,
            we check to ensure that the header matches the specified metric class. The file must
            be readable to get the header.)
        ValueError: If `append=True` and the provided file is a FIFO (named pipe).
        ValueError: If `append=True` and the provided file does not include a header.
        ValueError: If `append=True` and the header of the provided file does not match the
            specified metric class and the specified include/exclude fields.
Source code in fgpyo/util/metric.py
def __init__(
    self,
    filename: Union[Path, str],
    metric_class: Type[Metric],
    append: bool = False,
    delimiter: str = "\t",
    include_fields: Optional[List[str]] = None,
    exclude_fields: Optional[List[str]] = None,
    lineterminator: str = "\n",
    threads: Optional[int] = None,
) -> None:
    """
    Args:
        filename: Path to the file to write.
        metric_class: Metric class.
        append: If `True`, the file will be appended to. Otherwise, the specified file will be
            overwritten.
        delimiter: The output file delimiter.
        include_fields: If specified, only the listed fieldnames will be included when writing
            records to file. Fields will be written in the order provided.
            May not be used together with `exclude_fields`.
        exclude_fields: If specified, any listed fieldnames will be excluded when writing
            records to file.
            May not be used together with `include_fields`.
        lineterminator: The string used to terminate lines produced by the MetricWriter.
            Default = "\n".
        threads: the number of threads to use when compressing gzip files

    Raises:
        TypeError: If the provided metric class is not a dataclass- or attr-decorated
            subclass of `Metric`.
        AssertionError: If the provided filepath is not writable.
        AssertionError: If `append=True` and the provided file is not readable. (When appending,
            we check to ensure that the header matches the specified metric class. The file must
            be readable to get the header.)
        ValueError: If `append=True` and the provided file is a FIFO (named pipe).
        ValueError: If `append=True` and the provided file does not include a header.
        ValueError: If `append=True` and the header of the provided file does not match the
            specified metric class and the specified include/exclude fields.
    """

    filepath: Path = Path(filename)
    if (filepath.is_fifo() or filepath.is_char_device()) and append:
        raise ValueError("Cannot append to stdout, stderr, or other named pipe or stream")

    ordered_fieldnames: List[str] = _validate_and_generate_final_output_fieldnames(
        metric_class=metric_class,
        include_fields=include_fields,
        exclude_fields=exclude_fields,
    )

    _assert_is_metric_class(metric_class)
    io.assert_path_is_writable(filepath)
    if append:
        io.assert_path_is_readable(filepath)
        _assert_file_header_matches_metric(
            path=filepath,
            metric_class=metric_class,
            ordered_fieldnames=ordered_fieldnames,
            delimiter=delimiter,
        )

    self._metric_class = metric_class
    self._fieldnames = ordered_fieldnames
    self._fout = io.to_writer(filepath, append=append, threads=threads)
    self._writer = DictWriter(
        f=self._fout,
        fieldnames=self._fieldnames,
        delimiter=delimiter,
        lineterminator=lineterminator,
    )

    # If we aren't appending to an existing file, write the header before any rows
    if not append:
        self._writer.writeheader()
close
close() -> None

Close the underlying file handle.

Source code in fgpyo/util/metric.py
def close(self) -> None:
    """Close the underlying file handle."""
    self._fout.close()
write
write(metric: MetricType) -> None

Write a single Metric instance to file.

The Metric is converted to a dictionary and then written using the underlying csv.DictWriter. If the MetricWriter was created using the include_fields or exclude_fields arguments, the fields of the Metric are subset and/or reordered accordingly before writing.

Parameters:

Name Type Description Default
metric MetricType

An instance of the specified Metric.

required

Raises:

Type Description
TypeError

If the provided metric is not an instance of the Metric class used to parametrize the writer.

Source code in fgpyo/util/metric.py
def write(self, metric: MetricType) -> None:
    """
    Write a single Metric instance to file.

    The Metric is converted to a dictionary and then written using the underlying
    `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
    `exclude_fields` arguments, the fields of the Metric are subset and/or reordered
    accordingly before writing.

    Args:
        metric: An instance of the specified Metric.

    Raises:
        TypeError: If the provided `metric` is not an instance of the Metric class used to
            parametrize the writer.
    """

    # Serialize the Metric to a dict for writing by the underlying `DictWriter`
    row = {fieldname: val for fieldname, val in metric.formatted_items()}

    # Filter and/or re-order output fields if necessary
    row = {fieldname: row[fieldname] for fieldname in self._fieldnames}

    self._writer.writerow(row)
writeall
writeall(metrics: Iterable[MetricType]) -> None

Write multiple Metric instances to file.

Each Metric is converted to a dictionary and then written using the underlying csv.DictWriter. If the MetricWriter was created using the include_fields or exclude_fields arguments, the attributes of each Metric are subset and/or reordered accordingly before writing.

Parameters:

Name Type Description Default
metrics Iterable[MetricType]

A sequence of instances of the specified Metric.

required
Source code in fgpyo/util/metric.py
def writeall(self, metrics: Iterable[MetricType]) -> None:
    """
    Write multiple Metric instances to file.

    Each Metric is converted to a dictionary and then written using the underlying
    `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
    `exclude_fields` arguments, the attributes of each Metric are subset and/or reordered
    accordingly before writing.

    Args:
        metrics: A sequence of instances of the specified Metric.
    """
    for metric in metrics:
        self.write(metric)
Modules
string
Functions
column_it
column_it(rows: List[List[str]], delimiter: str = ' ') -> str

A simple version of Unix's column utility. This assumes the table is NxM.

Parameters:

Name Type Description Default
rows List[List[str]]

the rows to adjust. Each row must have the same number of delimited fields.

required
delimiter str

the delimiter for each field in a row.

' '
Source code in fgpyo/util/string.py
def column_it(rows: List[List[str]], delimiter: str = " ") -> str:
    """A simple version of Unix's `column` utility.  This assumes the table is NxM.

    Args:
        rows: the rows to adjust.  Each row must have the same number of delimited fields.
        delimiter: the delimiter for each field in a row.
    """
    # get the # of columns
    num_columns = len(rows[0])
    # for each column, find the maximum length of a cell
    max_column_lengths: List[int] = [
        max(len(row[col_i]) for row in rows) for col_i in range(num_columns)
    ]
    # pad each row in the table
    return "\n".join(
        delimiter.join(
            (" " * (max_column_lengths[col_i] - len(row[col_i]))) + row[col_i]
            for col_i in range(num_columns)
        )
        for row in rows
    )
types
Attributes
TypeAnnotation module-attribute
TypeAnnotation: TypeAlias = Union[type, _GenericAlias, UnionType, GenericAlias]

A function parameter's type annotation may be any of the following: 1) type, when declaring any of the built-in Python types 2) typing._GenericAlias, when declaring generic collection types or union types using pre-PEP 585 and pre-PEP 604 syntax (e.g. List[int], Optional[int], or Union[int, None]) 3) types.UnionType, when declaring union types using PEP604 syntax (e.g. int | None) 4) types.GenericAlias, when declaring generic collection types using PEP 585 syntax (e.g. list[int]) types.GenericAlias is a subclass of type, but typing._GenericAlias and types.UnionType are not and must be considered explicitly.

Functions
is_constructible_from_str
is_constructible_from_str(type_: type) -> bool

Returns true if the provided type can be constructed from a string

Source code in fgpyo/util/types.py
def is_constructible_from_str(type_: type) -> bool:
    """Returns true if the provided type can be constructed from a string"""
    try:
        sig = inspect.signature(type_)
        ((argname, _),) = sig.bind(object()).arguments.items()
    except TypeError:  # Can be raised by signature() or Signature.bind().
        return False
    except ValueError:
        # Can be raised for classes, if the relevant info is in `__init__`.
        if not isinstance(type_, type):
            raise
    else:
        if sig.parameters[argname].annotation is str:
            return True
    # FIXME
    # if isinstance(type_, type):
    #     # signature() first checks __new__, if it is present.
    #     return _is_constructible_from_str(type_.__init__(object(), type_))
    return False
is_list_like
is_list_like(type_: type) -> bool

Returns true if the value is a list or list like object

Source code in fgpyo/util/types.py
def is_list_like(type_: type) -> bool:
    """Returns true if the value is a list or list like object"""
    return typing.get_origin(type_) in [list, collections.abc.Iterable, collections.abc.Sequence]
make_enum_parser
make_enum_parser(enum: Type[EnumType]) -> partial

Makes a parser function for enum classes

Source code in fgpyo/util/types.py
def make_enum_parser(enum: Type[EnumType]) -> partial:
    """Makes a parser function for enum classes"""
    return partial(_make_enum_parser_worker, enum)
make_literal_parser
make_literal_parser(literal: Type[LiteralType], parsers: Iterable[Callable[[str], LiteralType]]) -> partial

Generates a parser function for a literal type object and a set of parsers for the possible parsers to that literal type object

Source code in fgpyo/util/types.py
def make_literal_parser(
    literal: Type[LiteralType], parsers: Iterable[Callable[[str], LiteralType]]
) -> partial:
    """Generates a parser function for a literal type object and a set of parsers for the possible
    parsers to that literal type object
    """
    return partial(_make_literal_parser_worker, literal, parsers)
make_union_parser
make_union_parser(union: Type[UnionType], parsers: Iterable[Callable[[str], UnionType]]) -> partial

Generates a parser function for a union type object and set of parsers for the possible parsers to that union type object

Source code in fgpyo/util/types.py
def make_union_parser(
    union: Type[UnionType], parsers: Iterable[Callable[[str], UnionType]]
) -> partial:
    """Generates a parser function for a union type object and set of parsers for the possible
    parsers to that union type object
    """
    return partial(_make_union_parser_worker, union, parsers)
none_parser
none_parser(value: str) -> Literal[None]

Returns None if the value is 'None', else raises an error

Source code in fgpyo/util/types.py
def none_parser(value: str) -> Literal[None]:
    """Returns None if the value is 'None', else raises an error"""
    if value == "":
        return None
    raise ValueError(f"NoneType not a valid type for {value}")
parse_bool
parse_bool(string: str) -> bool

Parses strings into bools accounting for the many different text representations of bools that can be used

Source code in fgpyo/util/types.py
def parse_bool(string: str) -> bool:
    """Parses strings into bools accounting for the many different text representations of bools
    that can be used
    """
    if string.lower() in ["t", "true", "1"]:
        return True
    elif string.lower() in ["f", "false", "0"]:
        return False
    else:
        raise ValueError("{} is not a valid boolean string".format(string))

vcf

Classes for generating VCF and records for testing

This module contains utility classes for the generation of VCF files and variant records, for use in testing.

The module contains the following public classes:

  • VariantBuilder() -- A builder class that allows the accumulation of variant records and access as a list and writing to file.
Examples

Typically, we have pysam.VariantRecord records obtained from reading from a VCF file. The VariantBuilder() class builds such records.

Variants are added with the add() method, which returns a pysam.VariantRecord.

>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
...     qual=40, filter=["PASS"]
... )

VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs

>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
... )

The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the to_path() method:

>>> from pathlib import Path
>>> path_to_vcf: Path = builder.to_path()  

The variants may also be retrieved in the order they were added via the to_unsorted_list() method and in coordinate sorted order via the to_sorted_list() method.

Functions

reader
reader(path: VcfPath) -> Generator[VariantFile, None, None]

Opens the given path for VCF reading

Parameters:

Name Type Description Default
path VcfPath

the path to a VCF, or an open file handle

required
Source code in fgpyo/vcf/__init__.py
@contextmanager
def reader(path: VcfPath) -> Generator[VcfReader, None, None]:
    """Opens the given path for VCF reading

    Args:
        path: the path to a VCF, or an open file handle
    """
    if isinstance(path, (str, Path, TextIO)):
        with fgpyo.io.suppress_stderr():
            # to avoid spamming log about index older than vcf, redirect stderr to /dev/null: only
            # when first opening the file
            _reader = VariantFile(path, mode="r")  # type: ignore[arg-type]
        # now stderr is back, so any later stderr messages will go through
        yield _reader
        _reader.close()
    else:
        raise TypeError(f"Cannot open '{type(path)}' for VCF reading.")
writer
writer(path: VcfPath, header: VariantHeader) -> Generator[VariantFile, None, None]

Opens the given path for VCF writing.

Parameters:

Name Type Description Default
path VcfPath

the path to a VCF, or an open filehandle

required
header VariantHeader

the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header

required
Source code in fgpyo/vcf/__init__.py
@contextmanager
def writer(path: VcfPath, header: VariantHeader) -> Generator[VcfWriter, None, None]:
    """Opens the given path for VCF writing.

    Args:
        path: the path to a VCF, or an open filehandle
        header: the source for the output VCF header. If you are modifying a VCF file that you are
                reading from, you can pass reader.header
    """
    # Convert Path to str such that pysam will autodetect to write as a gzipped file if provided
    # with a .vcf.gz suffix.
    if isinstance(path, Path):
        path = str(path)
    _writer = VariantFile(path, header=header, mode="w")
    yield _writer
    _writer.close()

Modules

builder
Classes for generating VCF and records for testing
Classes
VariantBuilder

Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.

Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.

A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be provided to the VariantBuilder constructor.

Variants are then added using the add() method. Once accumulated the variants can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). Lastly, the records can be written to a temporary file using to_path().

Attributes:

Name Type Description
sample_ids List[str]

the sample name(s)

sd Dict[str, Dict[str, Any]]

sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as contig_name) and "length", the contig length. Other values will be added to the VCF header line for that contig.

seq_idx_lookup Dict[str, int]

dictionary mapping contig name to index of contig in sd

records List[VariantRecord]

the list of variant records

header VariantHeader

the pysam header

Source code in fgpyo/vcf/builder.py
class VariantBuilder:
    """
    Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF
    can be sites-only, single-sample, or multi-sample.

    Provides the ability to manufacture variants from minimal arguments, while generating
    any remaining attributes to ensure a valid variant.

    A builder is constructed with a handful of defaults including the sample name and sequence
    dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be
    provided to the VariantBuilder constructor.

    Variants are then added using the [`add()`][fgpyo.vcf.builder.VariantBuilder.add]
    method.
    Once accumulated the variants can be accessed in the order in which they were created through
    the [`to_unsorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_sorted_list]. Lastly, the
    records can be written to a temporary file using
    [`to_path()`][fgpyo.vcf.builder.VariantBuilder.to_path].

    Attributes:
        sample_ids: the sample name(s)
        sd: sequence dictionary, implemented as python dict from contig name to dictionary with
            contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as
            contig_name) and "length", the contig length. Other values will be added to the VCF
            header line for that contig.
        seq_idx_lookup: dictionary mapping contig name to index of contig in sd
        records: the list of variant records
        header: the pysam header
    """

    sample_ids: List[str]
    sd: Dict[str, Dict[str, Any]]
    seq_idx_lookup: Dict[str, int]
    records: List[VariantRecord]
    header: VariantHeader

    def __init__(
        self,
        sample_ids: Optional[Iterable[str]] = None,
        sd: Optional[Dict[str, Dict[str, Any]]] = None,
    ) -> None:
        """Initializes a new VariantBuilder for generating variants and VCF files.

        Args:
            sample_ids: the name of the sample(s)
            sd: optional sequence dictionary
        """
        self.sample_ids: List[str] = list(sample_ids) if sample_ids is not None else []
        self.sd: Dict[str, Dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
        self.seq_idx_lookup: Dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
        self.records: List[VariantRecord] = []
        self.header = VariantHeader()
        for line in VariantBuilder._build_header_string(sd=self.sd):
            self.header.add_line(line)
        if sample_ids is not None:
            self.header.add_samples(sample_ids)

    @classmethod
    def default_sd(cls) -> Dict[str, Dict[str, Any]]:
        """Generates the sequence dictionary that is used by default by VariantBuilder.
        Re-uses the dictionary from SamBuilder for consistency.

        Returns:
            A new copy of the sequence dictionary as a map of contig name to dictionary, one per
            contig.
        """
        sd: Dict[str, Dict[str, Any]] = {}
        for sequence in SamBuilder.default_sd():
            contig = sequence["SN"]
            sd[contig] = {"ID": contig, "length": sequence["LN"]}
        return sd

    @classmethod
    def _build_header_string(cls, sd: Optional[Dict[str, Dict[str, Any]]] = None) -> Iterator[str]:
        """Builds the VCF header with the given sample name(s) and sequence dictionary.

        Args:
            sd: the sequence dictionary mapping the contig name to the key-value pairs for the
                given contig.  Must include "ID" and "length" for each contig.  If no sequence
                dictionary is given, will use the default dictionary.
        """
        if sd is None:
            sd = VariantBuilder.default_sd()
        # add mandatory VCF format
        yield "##fileformat=VCFv4.2"
        # add GT
        yield '##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">'
        # add additional common INFO lines
        yield '##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield (
            '##INFO=<ID=AR,Number=A,Type=Float,Description="Allele Ratio - ratio of AD for allele'
            ' vs. AD for modal allele.">'
        )
        yield '##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'
        # add additional common FORMAT lines
        yield (
            '##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt'
            ' alleles in the order listed">'
        )
        yield '##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield '##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'

        for d in sd.values():
            if "ID" not in d or "length" not in d:
                raise ValueError(
                    "Sequence dictionary must include 'ID' and 'length' for each contig."
                )
            contig_id = d["ID"]
            contig_length = d["length"]
            contig_header = f"##contig=<ID={contig_id},length={contig_length}"
            for key, value in d.items():
                if key == "ID" or key == "length":
                    continue
                contig_header += f",{key}={value}"
            contig_header += ">"
            yield contig_header

    @property
    def num_samples(self) -> int:
        return len(self.sample_ids)

    def add(
        self,
        contig: Optional[str] = None,
        pos: int = 1000,
        end: Optional[int] = None,
        id: str = ".",
        ref: str = "A",
        alts: Union[None, str, Iterable[str]] = (".",),
        qual: int = 60,
        filter: Union[None, str, Iterable[str]] = None,
        info: Optional[Dict[str, Any]] = None,
        samples: Optional[Dict[str, Dict[str, Any]]] = None,
    ) -> VariantRecord:
        """Generates a new variant and adds it to the internal collection.

        Notes:
        * Very little validation is done with respect to INFO and FORMAT keys being defined in the
        header.
        * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
        VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
        the property that should be accessed when using the records produced by this function (not
        "start").

        Args:
            contig: the chromosome name. If None, will use the first contig in the sequence
                    dictionary.
            pos: the 1-based position of the variant
            end: an optional 1-based inclusive END position; if not specified a value will be looked
                 for in info["END"], or calculated from the length of the reference allele
            id: the variant id
            ref: the reference allele
            alts: the list of alternate alleles, None if no alternates. If a single string is
                  passed, that will be used as the only alt.
            qual: the variant quality
            filter: the list of filters, None if no filters (ex. PASS). If a single string is
                    passed, that will be used as the only filter.
            info: the dictionary of INFO key-value pairs
            samples: the dictionary from sample name to FORMAT key-value pairs.
                     if a sample property is supplied for any sample but omitted in some, it will
                     be set to missing (".") for samples that don't have that property explicitly
                     assigned. If a sample in the VCF is omitted, all its properties will be set to
                     missing.
        """
        if contig is None:
            contig = next(iter(self.sd.keys()))

        if contig not in self.sd:
            raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
        # because there are a lot of slightly different objects related to samples or called
        # "samples" in this function, we alias samples to sample_formats
        # we still want to keep the API labeled "samples" because that keeps the naming scheme the
        # same as the pysam API
        sample_formats = samples
        if sample_formats is not None:
            unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
            if len(unknown_samples) > 0:
                raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

        if isinstance(alts, str):
            alts = (alts,)
        alleles = (ref,) if alts is None else (ref, *alts)
        if isinstance(filter, str):
            filter = (filter,)

        # pysam expects a list of format dicts provided in the same order as the samples in the
        # header (self.sample_ids). (This is despite the fact that it will internally represent the
        # values as a map from sample ID to format values, as we do in this function.)
        # Convert to that form and rename to record_samples; to a) disambiguate from the input
        # values, and b) prevent mypy from complaining about the type changing from dict to list.
        if self.num_samples == 0:
            # this is a sites-only VCF
            record_samples = None
        elif sample_formats is None or len(sample_formats) == 0:
            # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
            # no fields)
            record_samples = None
        else:
            # convert to list form that pysam expects, in order pysam expects
            # note: the copy {**format_dict} below is present because pysam actually alters the
            # input values, which would be an unintended side-effect (in fact without this, tests
            # fail because the expected input values are changed)
            record_samples = [
                {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
            ]

        variant = self.header.new_record(
            contig=contig,
            start=pos - 1,  # start is 0-based
            stop=self._compute_and_check_end(pos, ref, end, info),
            id=id,
            alleles=alleles,
            qual=qual,
            filter=filter,
            info=info,
            samples=record_samples,
        )

        self.records.append(variant)
        return variant

    def _compute_and_check_end(
        self, pos: int, ref: str, end: Optional[int], info: Optional[dict[str, Any]]
    ) -> int:
        """
        Derives the END/stop position for a new record based on the optionally provided `end`
        parameter, the presence/absence of END in the info dictionary and/or the length of the
        reference allele.

        Also checks that any given or calculated end position is at least greater than or equal
        to the record's position.

        Args:
            pos: the 1-based position of the record
            ref: the reference allele of the record
            end: the provided 1-based end position if one was given
            info: the info dictionary if one was given
        """
        if end is not None and info is not None and "END" in info:
            raise ValueError(f"Two end positions given; end={end} and info.END={info['END']}")
        elif end is None:
            if info is not None and "END" in info:
                end = int(info["END"])
            else:
                end = pos + len(ref) - 1

        if end < pos:
            raise ValueError(f"Invalid end position, {end}, given for variant as pos {pos}.")

        return end

    def to_path(self, path: Optional[Path] = None) -> Path:
        """
        Returns a path to a VCF for variants added to this builder.

        If the path given ends in ".gz" then the generated file will be bgzipped and
        a tabix index generated for the file with the suffix ".gz.tbi".

        Args:
            path: optional path to the VCF
        """
        # update the path
        path = self._to_vcf_path(path)

        # Create a writer and write to it
        with PysamWriter(path, header=self.header) as writer:
            for variant in self.to_sorted_list():
                writer.write(variant)

        if str(path.suffix) == ".gz":
            pysam.tabix_index(str(path), preset="vcf", force=True)

        return path

    @staticmethod
    def _to_vcf_path(path: Optional[Path]) -> Path:
        """Gets the path to a VCF file.  If path is a directory, a temporary VCF will be created in
        that directory. If path is `None`, then a temporary VCF will be created.  Otherwise, the
        given path is simply returned.

        Args:
            path: optionally the path to the VCF, or a directory to create a temporary VCF.
        """
        if path is None:
            with NamedTemporaryFile(suffix=".vcf.gz", delete=False) as fp:
                path = Path(fp.name)
            assert path.is_file()
        return path

    def to_unsorted_list(self) -> List[VariantRecord]:
        """Returns the accumulated records in the order they were created."""
        return list(self.records)

    def to_sorted_list(self) -> List[VariantRecord]:
        """Returns the accumulated records in coordinate order."""
        return sorted(self.records, key=self._sort_key)

    def _sort_key(self, variant: VariantRecord) -> Tuple[int, int, int]:
        return self.seq_idx_lookup[variant.contig], variant.start, variant.stop

    def add_header_line(self, line: str) -> None:
        """Adds a header line to the header"""
        self.header.add_line(line)

    def add_info_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: Union[int, VcfFieldNumber] = 1,
        description: Optional[str] = None,
        source: Optional[str] = None,
        version: Optional[str] = None,
    ) -> None:
        """Add an INFO header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
            source: the source of the field
            version: the version of the field
        """
        if field_type == VcfFieldType.FLAG:
            num = "0"  # FLAGs always have number = 0
        elif isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        if source is not None:
            header_line += f",Source={source}"
        if version is not None:
            header_line += f",Version={version}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_format_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: Union[int, VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES,
        description: Optional[str] = None,
    ) -> None:
        """
        Add a FORMAT header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
        """
        if isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_filter_header(
        self,
        name: str,
        description: Optional[str] = None,
    ) -> None:
        """
        Add a FILTER header field to the VCF header.

        Args:
            name: the name of the field
            description: the description of the field
        """
        header_line = f"##FILTER=<ID={name}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)
Functions
__init__
__init__(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None) -> None

Initializes a new VariantBuilder for generating variants and VCF files.

Parameters:

Name Type Description Default
sample_ids Optional[Iterable[str]]

the name of the sample(s)

None
sd Optional[Dict[str, Dict[str, Any]]]

optional sequence dictionary

None
Source code in fgpyo/vcf/builder.py
def __init__(
    self,
    sample_ids: Optional[Iterable[str]] = None,
    sd: Optional[Dict[str, Dict[str, Any]]] = None,
) -> None:
    """Initializes a new VariantBuilder for generating variants and VCF files.

    Args:
        sample_ids: the name of the sample(s)
        sd: optional sequence dictionary
    """
    self.sample_ids: List[str] = list(sample_ids) if sample_ids is not None else []
    self.sd: Dict[str, Dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
    self.seq_idx_lookup: Dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
    self.records: List[VariantRecord] = []
    self.header = VariantHeader()
    for line in VariantBuilder._build_header_string(sd=self.sd):
        self.header.add_line(line)
    if sample_ids is not None:
        self.header.add_samples(sample_ids)
add
add(contig: Optional[str] = None, pos: int = 1000, end: Optional[int] = None, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) -> VariantRecord

Generates a new variant and adds it to the internal collection.

Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is the property that should be accessed when using the records produced by this function (not "start").

Parameters:

Name Type Description Default
contig Optional[str]

the chromosome name. If None, will use the first contig in the sequence dictionary.

None
pos int

the 1-based position of the variant

1000
end Optional[int]

an optional 1-based inclusive END position; if not specified a value will be looked for in info["END"], or calculated from the length of the reference allele

None
id str

the variant id

'.'
ref str

the reference allele

'A'
alts Union[None, str, Iterable[str]]

the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.

('.',)
qual int

the variant quality

60
filter Union[None, str, Iterable[str]]

the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.

None
info Optional[Dict[str, Any]]

the dictionary of INFO key-value pairs

None
samples Optional[Dict[str, Dict[str, Any]]]

the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (".") for samples that don't have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.

None
Source code in fgpyo/vcf/builder.py
def add(
    self,
    contig: Optional[str] = None,
    pos: int = 1000,
    end: Optional[int] = None,
    id: str = ".",
    ref: str = "A",
    alts: Union[None, str, Iterable[str]] = (".",),
    qual: int = 60,
    filter: Union[None, str, Iterable[str]] = None,
    info: Optional[Dict[str, Any]] = None,
    samples: Optional[Dict[str, Dict[str, Any]]] = None,
) -> VariantRecord:
    """Generates a new variant and adds it to the internal collection.

    Notes:
    * Very little validation is done with respect to INFO and FORMAT keys being defined in the
    header.
    * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
    VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
    the property that should be accessed when using the records produced by this function (not
    "start").

    Args:
        contig: the chromosome name. If None, will use the first contig in the sequence
                dictionary.
        pos: the 1-based position of the variant
        end: an optional 1-based inclusive END position; if not specified a value will be looked
             for in info["END"], or calculated from the length of the reference allele
        id: the variant id
        ref: the reference allele
        alts: the list of alternate alleles, None if no alternates. If a single string is
              passed, that will be used as the only alt.
        qual: the variant quality
        filter: the list of filters, None if no filters (ex. PASS). If a single string is
                passed, that will be used as the only filter.
        info: the dictionary of INFO key-value pairs
        samples: the dictionary from sample name to FORMAT key-value pairs.
                 if a sample property is supplied for any sample but omitted in some, it will
                 be set to missing (".") for samples that don't have that property explicitly
                 assigned. If a sample in the VCF is omitted, all its properties will be set to
                 missing.
    """
    if contig is None:
        contig = next(iter(self.sd.keys()))

    if contig not in self.sd:
        raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
    # because there are a lot of slightly different objects related to samples or called
    # "samples" in this function, we alias samples to sample_formats
    # we still want to keep the API labeled "samples" because that keeps the naming scheme the
    # same as the pysam API
    sample_formats = samples
    if sample_formats is not None:
        unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
        if len(unknown_samples) > 0:
            raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

    if isinstance(alts, str):
        alts = (alts,)
    alleles = (ref,) if alts is None else (ref, *alts)
    if isinstance(filter, str):
        filter = (filter,)

    # pysam expects a list of format dicts provided in the same order as the samples in the
    # header (self.sample_ids). (This is despite the fact that it will internally represent the
    # values as a map from sample ID to format values, as we do in this function.)
    # Convert to that form and rename to record_samples; to a) disambiguate from the input
    # values, and b) prevent mypy from complaining about the type changing from dict to list.
    if self.num_samples == 0:
        # this is a sites-only VCF
        record_samples = None
    elif sample_formats is None or len(sample_formats) == 0:
        # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
        # no fields)
        record_samples = None
    else:
        # convert to list form that pysam expects, in order pysam expects
        # note: the copy {**format_dict} below is present because pysam actually alters the
        # input values, which would be an unintended side-effect (in fact without this, tests
        # fail because the expected input values are changed)
        record_samples = [
            {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
        ]

    variant = self.header.new_record(
        contig=contig,
        start=pos - 1,  # start is 0-based
        stop=self._compute_and_check_end(pos, ref, end, info),
        id=id,
        alleles=alleles,
        qual=qual,
        filter=filter,
        info=info,
        samples=record_samples,
    )

    self.records.append(variant)
    return variant
add_filter_header
add_filter_header(name: str, description: Optional[str] = None) -> None

Add a FILTER header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
description Optional[str]

the description of the field

None
Source code in fgpyo/vcf/builder.py
def add_filter_header(
    self,
    name: str,
    description: Optional[str] = None,
) -> None:
    """
    Add a FILTER header field to the VCF header.

    Args:
        name: the name of the field
        description: the description of the field
    """
    header_line = f"##FILTER=<ID={name}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)
add_format_header
add_format_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = NUM_GENOTYPES, description: Optional[str] = None) -> None

Add a FORMAT header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
field_type VcfFieldType

the field_type of the field

required
number Union[int, VcfFieldNumber]

the number of the field

NUM_GENOTYPES
description Optional[str]

the description of the field

None
Source code in fgpyo/vcf/builder.py
def add_format_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: Union[int, VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES,
    description: Optional[str] = None,
) -> None:
    """
    Add a FORMAT header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
    """
    if isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)
add_header_line
add_header_line(line: str) -> None

Adds a header line to the header

Source code in fgpyo/vcf/builder.py
def add_header_line(self, line: str) -> None:
    """Adds a header line to the header"""
    self.header.add_line(line)
add_info_header
add_info_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) -> None

Add an INFO header field to the VCF header.

Parameters:

Name Type Description Default
name str

the name of the field

required
field_type VcfFieldType

the field_type of the field

required
number Union[int, VcfFieldNumber]

the number of the field

1
description Optional[str]

the description of the field

None
source Optional[str]

the source of the field

None
version Optional[str]

the version of the field

None
Source code in fgpyo/vcf/builder.py
def add_info_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: Union[int, VcfFieldNumber] = 1,
    description: Optional[str] = None,
    source: Optional[str] = None,
    version: Optional[str] = None,
) -> None:
    """Add an INFO header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
        source: the source of the field
        version: the version of the field
    """
    if field_type == VcfFieldType.FLAG:
        num = "0"  # FLAGs always have number = 0
    elif isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    if source is not None:
        header_line += f",Source={source}"
    if version is not None:
        header_line += f",Version={version}"
    header_line += ">"
    self.add_header_line(header_line)
default_sd classmethod
default_sd() -> Dict[str, Dict[str, Any]]

Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.

Returns:

Type Description
Dict[str, Dict[str, Any]]

A new copy of the sequence dictionary as a map of contig name to dictionary, one per

Dict[str, Dict[str, Any]]

contig.

Source code in fgpyo/vcf/builder.py
@classmethod
def default_sd(cls) -> Dict[str, Dict[str, Any]]:
    """Generates the sequence dictionary that is used by default by VariantBuilder.
    Re-uses the dictionary from SamBuilder for consistency.

    Returns:
        A new copy of the sequence dictionary as a map of contig name to dictionary, one per
        contig.
    """
    sd: Dict[str, Dict[str, Any]] = {}
    for sequence in SamBuilder.default_sd():
        contig = sequence["SN"]
        sd[contig] = {"ID": contig, "length": sequence["LN"]}
    return sd
to_path
to_path(path: Optional[Path] = None) -> Path

Returns a path to a VCF for variants added to this builder.

If the path given ends in ".gz" then the generated file will be bgzipped and a tabix index generated for the file with the suffix ".gz.tbi".

Parameters:

Name Type Description Default
path Optional[Path]

optional path to the VCF

None
Source code in fgpyo/vcf/builder.py
def to_path(self, path: Optional[Path] = None) -> Path:
    """
    Returns a path to a VCF for variants added to this builder.

    If the path given ends in ".gz" then the generated file will be bgzipped and
    a tabix index generated for the file with the suffix ".gz.tbi".

    Args:
        path: optional path to the VCF
    """
    # update the path
    path = self._to_vcf_path(path)

    # Create a writer and write to it
    with PysamWriter(path, header=self.header) as writer:
        for variant in self.to_sorted_list():
            writer.write(variant)

    if str(path.suffix) == ".gz":
        pysam.tabix_index(str(path), preset="vcf", force=True)

    return path
to_sorted_list
to_sorted_list() -> List[VariantRecord]

Returns the accumulated records in coordinate order.

Source code in fgpyo/vcf/builder.py
def to_sorted_list(self) -> List[VariantRecord]:
    """Returns the accumulated records in coordinate order."""
    return sorted(self.records, key=self._sort_key)
to_unsorted_list
to_unsorted_list() -> List[VariantRecord]

Returns the accumulated records in the order they were created.

Source code in fgpyo/vcf/builder.py
def to_unsorted_list(self) -> List[VariantRecord]:
    """Returns the accumulated records in the order they were created."""
    return list(self.records)
VcfFieldNumber

Bases: Enum

Special codes for VCF field numbers

Source code in fgpyo/vcf/builder.py
class VcfFieldNumber(Enum):
    """Special codes for VCF field numbers"""

    NUM_ALT_ALLELES = "A"
    NUM_ALLELES = "R"
    NUM_GENOTYPES = "G"
    UNKNOWN = "."
VcfFieldType

Bases: Enum

Codes for VCF field types

Source code in fgpyo/vcf/builder.py
class VcfFieldType(Enum):
    """Codes for VCF field types"""

    INTEGER = "Integer"
    FLOAT = "Float"
    FLAG = "Flag"
    CHARACTER = "Character"
    STRING = "String"
Functions