Skip to content

sam

Utility Classes and Methods for SAM/BAM

This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.

Motivation for Reader and Writer methods

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

  1. Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
  2. Makes the requirement to provide a header when opening a file for writing more explicit.
  3. Adds support for pathlib.Path.
  4. Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
  5. An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.

Examples of Opening a SAM/BAM for Reading or Writing

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType() for the supported file types.

>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader("/path/to/sample.bam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:  
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.query_name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }  
>>> with writer(path="/path/to/sample.bam", header=header) as fh:  
...     pass  # do something

Examples of Manipulating Cigars

Creating a Cigar from a pysam.AlignedSegment.

>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:  
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str().

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
Traceback (most recent call last):
    ...
fgpyo.sam.CigarParsingException: Malformed cigar: 10M5[U]

The cigar contains a tuple of CigarElement()s. Each element contains the cigar operator (CigarOp()) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Any particular element can be accessed directly via .elements with its index (and works with negative indexes and slices):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> cigar.elements[0].length
50
>>> cigar.elements[1].operator
<CigarOp.D: (2, 'D', False, True)>
>>> cigar.elements[-1].operator
<CigarOp.S: (4, 'S', True, False)>
>>> tuple(x.operator.character for x in cigar.elements[1:3])
('D', 'M')
>>> tuple(x.operator.character for x in cigar.elements[-2:])
('I', 'S')

Examples of parsing the SA tag and individual supplementary alignments

>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1'
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']

Attributes

DefaultProperlyPairedOrientations module-attribute

DefaultProperlyPairedOrientations: set[PairOrientation] = {FR}

The default orientations for properly paired reads.

NO_QUERY_QUALITIES module-attribute

NO_QUERY_QUALITIES: array = qualitystring_to_array(STRING_PLACEHOLDER)

The quality array corresponding to an unavailable query quality string ("*").

NO_REF_INDEX module-attribute

NO_REF_INDEX: int = -1

The reference index to use to indicate no reference in SAM/BAM.

NO_REF_NAME module-attribute

NO_REF_NAME: str = STRING_PLACEHOLDER

The reference name to use to indicate no reference in SAM/BAM.

NO_REF_POS module-attribute

NO_REF_POS: int = -1

The reference position to use to indicate no position in SAM/BAM.

STRING_PLACEHOLDER module-attribute

STRING_PLACEHOLDER: str = '*'

The value to use when a string field's information is unavailable.

SamPath module-attribute

SamPath = Union[IO[Any], Path, str]

The valid base classes for opening a SAM/BAM/CRAM file.

Classes

Cigar

Class representing a cigar string.

Attributes:

Name Type Description
- elements (Tuple[CigarElement, ...]

zero or more cigar elements

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, slots=True, auto_attribs=True)
class Cigar:
    """Class representing a cigar string.

    Attributes:
        - elements (Tuple[CigarElement, ...]): zero or more cigar elements
    """

    elements: Tuple[CigarElement, ...] = ()

    @classmethod
    def from_cigartuples(cls, cigartuples: Optional[List[Tuple[int, int]]]) -> "Cigar":
        """Returns a Cigar from a list of tuples returned by pysam.

        Each tuple denotes the operation and length.  See
        [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
        various operators.  If None is given, returns an empty Cigar.
        """
        if cigartuples is None or cigartuples == []:
            return Cigar()
        try:
            elements = []
            for code, length in cigartuples:
                operator = CigarOp.from_code(code)
                elements.append(CigarElement(length, operator))
            return Cigar(tuple(elements))
        except Exception as ex:
            raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex

    @classmethod
    def _pretty_cigarstring_exception(cls, cigarstring: str, index: int) -> CigarParsingException:
        """Raises an exception highlighting the malformed character"""
        prefix = cigarstring[:index]
        character = cigarstring[index] if index < len(cigarstring) else ""
        suffix = cigarstring[index + 1 :]
        pretty_cigarstring = f"{prefix}[{character}]{suffix}"
        message = f"Malformed cigar: {pretty_cigarstring}"
        return CigarParsingException(message)

    @classmethod
    def from_cigarstring(cls, cigarstring: str) -> "Cigar":
        """Constructs a Cigar from a string returned by pysam.

        If "*" is given, returns an empty Cigar.
        """
        if cigarstring == "*":
            return Cigar()

        cigarstring_length = len(cigarstring)
        if cigarstring_length == 0:
            raise CigarParsingException("Cigar string was empty")

        elements = []
        i = 0
        while i < cigarstring_length:
            if not cigarstring[i].isdigit():
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            length = int(cigarstring[i])
            i += 1
            while i < cigarstring_length and cigarstring[i].isdigit():
                length = (length * 10) + int(cigarstring[i])
                i += 1
            if i == cigarstring_length:
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            try:
                operator = CigarOp.from_character(cigarstring[i])
                elements.append(CigarElement(length, operator))
            except KeyError as ex:
                # cigar operator was not valid
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            except IndexError as ex:
                # missing cigar operator (i == len(cigarstring))
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            i += 1
        return Cigar(tuple(elements))

    def __str__(self) -> str:
        if self.elements:
            return "".join([str(e) for e in self.elements])
        else:
            return "*"

    def reversed(self) -> "Cigar":
        """Returns a copy of the Cigar with the elements in reverse order."""
        return Cigar(tuple(reversed(self.elements)))

    def length_on_query(self) -> int:
        """Returns the length of the alignment on the query sequence."""
        return sum([elem.length_on_query for elem in self.elements])

    def length_on_target(self) -> int:
        """Returns the length of the alignment on the target sequence."""
        return sum([elem.length_on_target for elem in self.elements])

    def query_alignment_offsets(self) -> Tuple[int, int]:
        """
        Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

        The resulting range will contain the range of positions in the SEQ string for
        the bases that are aligned.
        If counting from the end of the query is desired, use
        `cigar.reversed().query_alignment_offsets()`

        Returns:
            A tuple (start, stop) containing the start and stop positions
                of the aligned part of the query. These offsets are 0-based and open-ended, with
                respect to the beginning of the query.

        Raises:
            ValueError: If according to the cigar, there are no aligned query bases.
        """
        start_offset: int = 0
        end_offset: int = 0
        element: CigarElement
        alignment_began = False
        for element in self.elements:
            if element.operator.is_clipping and not alignment_began:
                # We are in the clipping operators preceding the alignment
                # Note: hardclips have length-on-query=0
                start_offset += element.length_on_query
                end_offset += element.length_on_query
            elif not element.operator.is_clipping:
                # We are within the alignment
                alignment_began = True
                end_offset += element.length_on_query
            else:
                # We have exited the alignment and are in the clipping operators after the alignment
                break

        if start_offset == end_offset:
            raise ValueError(f"Cigar {self} has no aligned bases")
        return start_offset, end_offset

Functions

from_cigarstring classmethod
from_cigarstring(cigarstring: str) -> Cigar

Constructs a Cigar from a string returned by pysam.

If "*" is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_cigarstring(cls, cigarstring: str) -> "Cigar":
    """Constructs a Cigar from a string returned by pysam.

    If "*" is given, returns an empty Cigar.
    """
    if cigarstring == "*":
        return Cigar()

    cigarstring_length = len(cigarstring)
    if cigarstring_length == 0:
        raise CigarParsingException("Cigar string was empty")

    elements = []
    i = 0
    while i < cigarstring_length:
        if not cigarstring[i].isdigit():
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        length = int(cigarstring[i])
        i += 1
        while i < cigarstring_length and cigarstring[i].isdigit():
            length = (length * 10) + int(cigarstring[i])
            i += 1
        if i == cigarstring_length:
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        try:
            operator = CigarOp.from_character(cigarstring[i])
            elements.append(CigarElement(length, operator))
        except KeyError as ex:
            # cigar operator was not valid
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        except IndexError as ex:
            # missing cigar operator (i == len(cigarstring))
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        i += 1
    return Cigar(tuple(elements))
from_cigartuples classmethod
from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) -> Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp() for more information on the various operators. If None is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_cigartuples(cls, cigartuples: Optional[List[Tuple[int, int]]]) -> "Cigar":
    """Returns a Cigar from a list of tuples returned by pysam.

    Each tuple denotes the operation and length.  See
    [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
    various operators.  If None is given, returns an empty Cigar.
    """
    if cigartuples is None or cigartuples == []:
        return Cigar()
    try:
        elements = []
        for code, length in cigartuples:
            operator = CigarOp.from_code(code)
            elements.append(CigarElement(length, operator))
        return Cigar(tuple(elements))
    except Exception as ex:
        raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex
length_on_query
length_on_query() -> int

Returns the length of the alignment on the query sequence.

Source code in fgpyo/sam/__init__.py
def length_on_query(self) -> int:
    """Returns the length of the alignment on the query sequence."""
    return sum([elem.length_on_query for elem in self.elements])
length_on_target
length_on_target() -> int

Returns the length of the alignment on the target sequence.

Source code in fgpyo/sam/__init__.py
def length_on_target(self) -> int:
    """Returns the length of the alignment on the target sequence."""
    return sum([elem.length_on_target for elem in self.elements])
query_alignment_offsets
query_alignment_offsets() -> Tuple[int, int]

Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

The resulting range will contain the range of positions in the SEQ string for the bases that are aligned. If counting from the end of the query is desired, use cigar.reversed().query_alignment_offsets()

Returns:

Type Description
Tuple[int, int]

A tuple (start, stop) containing the start and stop positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query.

Raises:

Type Description
ValueError

If according to the cigar, there are no aligned query bases.

Source code in fgpyo/sam/__init__.py
def query_alignment_offsets(self) -> Tuple[int, int]:
    """
    Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

    The resulting range will contain the range of positions in the SEQ string for
    the bases that are aligned.
    If counting from the end of the query is desired, use
    `cigar.reversed().query_alignment_offsets()`

    Returns:
        A tuple (start, stop) containing the start and stop positions
            of the aligned part of the query. These offsets are 0-based and open-ended, with
            respect to the beginning of the query.

    Raises:
        ValueError: If according to the cigar, there are no aligned query bases.
    """
    start_offset: int = 0
    end_offset: int = 0
    element: CigarElement
    alignment_began = False
    for element in self.elements:
        if element.operator.is_clipping and not alignment_began:
            # We are in the clipping operators preceding the alignment
            # Note: hardclips have length-on-query=0
            start_offset += element.length_on_query
            end_offset += element.length_on_query
        elif not element.operator.is_clipping:
            # We are within the alignment
            alignment_began = True
            end_offset += element.length_on_query
        else:
            # We have exited the alignment and are in the clipping operators after the alignment
            break

    if start_offset == end_offset:
        raise ValueError(f"Cigar {self} has no aligned bases")
    return start_offset, end_offset
reversed
reversed() -> Cigar

Returns a copy of the Cigar with the elements in reverse order.

Source code in fgpyo/sam/__init__.py
def reversed(self) -> "Cigar":
    """Returns a copy of the Cigar with the elements in reverse order."""
    return Cigar(tuple(reversed(self.elements)))

CigarElement

Represents an element in a Cigar

Attributes:

Name Type Description
- length (int

the length of the element

- operator (CigarOp

the operator of the element

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, slots=True, auto_attribs=True)
class CigarElement:
    """Represents an element in a Cigar

    Attributes:
        - length (int): the length of the element
        - operator (CigarOp): the operator of the element
    """

    length: int
    operator: CigarOp

    def __attrs_post_init__(self) -> None:
        """Validates the length attribute is greater than zero."""
        if self.length <= 0:
            raise ValueError(f"Cigar element must have a length > 0, found {self.length}")

    @property
    def length_on_query(self) -> int:
        """Returns the length of the element on the query sequence."""
        return self.length if self.operator.consumes_query else 0

    @property
    def length_on_target(self) -> int:
        """Returns the length of the element on the target (often reference) sequence."""
        return self.length if self.operator.consumes_reference else 0

    def __str__(self) -> str:
        return f"{self.length}{self.operator.character}"

Attributes

length_on_query property
length_on_query: int

Returns the length of the element on the query sequence.

length_on_target property
length_on_target: int

Returns the length of the element on the target (often reference) sequence.

Functions

__attrs_post_init__
__attrs_post_init__() -> None

Validates the length attribute is greater than zero.

Source code in fgpyo/sam/__init__.py
def __attrs_post_init__(self) -> None:
    """Validates the length attribute is greater than zero."""
    if self.length <= 0:
        raise ValueError(f"Cigar element must have a length > 0, found {self.length}")

CigarOp

Bases: Enum

Enumeration of operators that can appear in a Cigar string.

Attributes:

Name Type Description
code int

The ~pysam cigar operator code.

character int

The single character cigar operator.

consumes_query bool

True if this operator consumes query bases, False otherwise.

consumes_target bool

True if this operator consumes target bases, False otherwise.

Source code in fgpyo/sam/__init__.py
@enum.unique
class CigarOp(enum.Enum):
    """Enumeration of operators that can appear in a Cigar string.

    Attributes:
        code (int): The `~pysam` cigar operator code.
        character (int): The single character cigar operator.
        consumes_query (bool): True if this operator consumes query bases, False otherwise.
        consumes_target (bool): True if this operator consumes target bases, False otherwise.
    """

    M = (0, "M", True, True)  #: Match or Mismatch the reference
    I = (1, "I", True, False)  #: Insertion versus the reference  # noqa: E741
    D = (2, "D", False, True)  #: Deletion versus the reference
    N = (3, "N", False, True)  #: Skipped region from the reference
    S = (4, "S", True, False)  #: Soft clip
    H = (5, "H", False, False)  #: Hard clip
    P = (6, "P", False, False)  #: Padding
    EQ = (7, "=", True, True)  #: Matches the reference
    X = (8, "X", True, True)  #: Mismatches the reference

    def __init__(
        self, code: int, character: str, consumes_query: bool, consumes_reference: bool
    ) -> None:
        self.code = code
        self.character = character
        self.consumes_query = consumes_query
        self.consumes_reference = consumes_reference

    @staticmethod
    def from_character(character: str) -> "CigarOp":
        """Returns the operator from the single character."""
        if CigarOp.EQ.character == character:
            return CigarOp.EQ
        else:
            return CigarOp[character]

    @staticmethod
    def from_code(code: int) -> "CigarOp":
        """Returns the operator from the given operator code.

        Note: this is mainly used to get the operator from :py:mod:`~pysam`.
        """
        return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]

    @property
    def is_indel(self) -> bool:
        """Returns true if the operator is an indel, false otherwise."""
        return self == CigarOp.I or self == CigarOp.D

    @property
    def is_clipping(self) -> bool:
        """Returns true if the operator is a soft/hard clip, false otherwise."""
        return self == CigarOp.S or self == CigarOp.H

Attributes

is_clipping property
is_clipping: bool

Returns true if the operator is a soft/hard clip, false otherwise.

is_indel property
is_indel: bool

Returns true if the operator is an indel, false otherwise.

Functions

from_character staticmethod
from_character(character: str) -> CigarOp

Returns the operator from the single character.

Source code in fgpyo/sam/__init__.py
@staticmethod
def from_character(character: str) -> "CigarOp":
    """Returns the operator from the single character."""
    if CigarOp.EQ.character == character:
        return CigarOp.EQ
    else:
        return CigarOp[character]
from_code staticmethod
from_code(code: int) -> CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from :py:mod:~pysam.

Source code in fgpyo/sam/__init__.py
@staticmethod
def from_code(code: int) -> "CigarOp":
    """Returns the operator from the given operator code.

    Note: this is mainly used to get the operator from :py:mod:`~pysam`.
    """
    return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]

CigarParsingException

Bases: Exception

The exception raised specific to parsing a cigar.

Source code in fgpyo/sam/__init__.py
class CigarParsingException(Exception):
    """The exception raised specific to parsing a cigar."""

    pass

PairOrientation

Bases: Enum

Enumerations of read pair orientations.

Source code in fgpyo/sam/__init__.py
@enum.unique
class PairOrientation(enum.Enum):
    """Enumerations of read pair orientations."""

    FR = "FR"
    """A pair orientation for forward-reverse reads ("innie")."""

    RF = "RF"
    """A pair orientation for reverse-forward reads ("outie")."""

    TANDEM = "TANDEM"
    """A pair orientation for tandem (forward-forward or reverse-reverse) reads."""

    @classmethod
    def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
        cls, rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None
    ) -> Optional["PairOrientation"]:
        """Returns the pair orientation if both reads are mapped to the same reference sequence.

        Args:
            rec1: The first record in the pair.
            rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

        See:
            [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
        """

        if rec2 is None:
            rec2_is_unmapped = rec1.mate_is_unmapped
            rec2_reference_id = rec1.next_reference_id
        else:
            rec2_is_unmapped = rec2.is_unmapped
            rec2_reference_id = rec2.reference_id

        if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
            return None

        if rec2 is None:
            rec2_is_forward = rec1.mate_is_forward
            rec2_reference_start = rec1.next_reference_start
        else:
            rec2_is_forward = rec2.is_forward
            rec2_reference_start = rec2.reference_start

        if rec1.is_forward is rec2_is_forward:
            return PairOrientation.TANDEM
        if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
            return PairOrientation.RF

        if rec2 is None:
            if not rec1.has_tag("MC"):
                raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
            rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
            rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
        else:
            rec2_reference_end = rec2.reference_end

        if rec1.reference_start < rec2_reference_end:
            return PairOrientation.FR
        else:
            return PairOrientation.RF

Attributes

FR class-attribute instance-attribute
FR = 'FR'

A pair orientation for forward-reverse reads ("innie").

RF class-attribute instance-attribute
RF = 'RF'

A pair orientation for reverse-forward reads ("outie").

TANDEM class-attribute instance-attribute
TANDEM = 'TANDEM'

A pair orientation for tandem (forward-forward or reverse-reverse) reads.

Functions

from_recs classmethod
from_recs(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> Optional[PairOrientation]

Returns the pair orientation if both reads are mapped to the same reference sequence.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
See

htsjdk.samtools.SamPairUtil.getPairOrientation()

Source code in fgpyo/sam/__init__.py
@classmethod
def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
    cls, rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None
) -> Optional["PairOrientation"]:
    """Returns the pair orientation if both reads are mapped to the same reference sequence.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

    See:
        [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
    """

    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return None

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    if rec1.is_forward is rec2_is_forward:
        return PairOrientation.TANDEM
    if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
        return PairOrientation.RF

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        rec2_reference_end = rec2.reference_end

    if rec1.reference_start < rec2_reference_end:
        return PairOrientation.FR
    else:
        return PairOrientation.RF

ReadEditInfo

Counts various stats about how a read compares to a reference sequence.

Attributes:

Name Type Description
matches int

the number of bases in the read that match the reference

mismatches int

the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.

insertions int

the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.

inserted_bases int

the total number of bases contained within insertions in the read

deletions int

the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.

deleted_bases int

the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).

nm int

the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class ReadEditInfo:
    """
    Counts various stats about how a read compares to a reference sequence.

    Attributes:
        matches: the number of bases in the read that match the reference
        mismatches: the number of mismatches between the read sequence and the reference sequence
            as dictated by the alignment.  Like as defined for the SAM NM tag computation, any base
            except A/C/G/T in the read is considered a mismatch.
        insertions: the number of insertions in the read vs. the reference.  I.e. the number of I
            operators in the CIGAR string.
        inserted_bases: the total number of bases contained within insertions in the read
        deletions: the number of deletions in the read vs. the reference.  I.e. the number of D
            operators in the CIGAT string.
        deleted_bases: the total number of that are deleted within the alignment (i.e. bases in
            the reference but not in the read).
        nm: the computed value of the SAM NM tag, calculated as mismatches + inserted_bases +
            deleted_bases
    """

    matches: int
    mismatches: int
    insertions: int
    inserted_bases: int
    deletions: int
    deleted_bases: int
    nm: int

SamFileType

Bases: Enum

Enumeration of valid SAM/BAM/CRAM file types.

Attributes:

Name Type Description
mode str

The additional mode character to add when opening this file type.

ext str

The standard file extension for this file type.

Source code in fgpyo/sam/__init__.py
@enum.unique
class SamFileType(enum.Enum):
    """Enumeration of valid SAM/BAM/CRAM file types.

    Attributes:
        mode (str): The additional mode character to add when opening this file type.
        ext (str): The standard file extension for this file type.
    """

    def __init__(self, mode: str, ext: str) -> None:
        self.mode = mode
        self.extension = ext

    SAM = ("", ".sam")
    BAM = ("b", ".bam")
    CRAM = ("c", ".cram")

    @property
    def indexable(self) -> bool:
        """True if the file type can be indexed, false otherwise."""
        return self is SamFileType.BAM or self is SamFileType.CRAM

    @classmethod
    def from_path(cls, path: Union[Path, str]) -> "SamFileType":
        """Infers the file type based on the file extension.

        Args:
            path: the path to the SAM/BAM/CRAM to read or write.
        """
        ext = Path(path).suffix
        try:
            return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
        except StopIteration as ex:
            raise ValueError(f"Could not infer file type from {path}") from ex

Attributes

indexable property
indexable: bool

True if the file type can be indexed, false otherwise.

Functions

from_path classmethod
from_path(path: Union[Path, str]) -> SamFileType

Infers the file type based on the file extension.

Parameters:

Name Type Description Default
path Union[Path, str]

the path to the SAM/BAM/CRAM to read or write.

required
Source code in fgpyo/sam/__init__.py
@classmethod
def from_path(cls, path: Union[Path, str]) -> "SamFileType":
    """Infers the file type based on the file extension.

    Args:
        path: the path to the SAM/BAM/CRAM to read or write.
    """
    ext = Path(path).suffix
    try:
        return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
    except StopIteration as ex:
        raise ValueError(f"Could not infer file type from {path}") from ex

SamOrder

Bases: Enum

Enumerations of possible sort orders for a SAM file.

Source code in fgpyo/sam/__init__.py
class SamOrder(enum.Enum):
    """
    Enumerations of possible sort orders for a SAM file.
    """

    Unsorted = "unsorted"  #: the SAM / BAM / CRAM is unsorted
    Coordinate = "coordinate"  #: coordinate sorted
    QueryName = "queryname"  #: queryname sorted
    Unknown = "unknown"  # Unknown SAM / BAM / CRAM sort order

SupplementaryAlignment

Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

Attributes:

Name Type Description
reference_name str

the name of the reference (i.e. contig, chromosome) aligned to

start int

the 0-based start position of the alignment

is_forward bool

true if the alignment is in the forward strand, false otherwise

cigar Cigar

the cigar for the alignment

mapq int

the mapping quality

nm int

the number of edits

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class SupplementaryAlignment:
    """Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

    Attributes:
        reference_name: the name of the reference (i.e. contig, chromosome) aligned to
        start: the 0-based start position of the alignment
        is_forward: true if the alignment is in the forward strand, false otherwise
        cigar: the cigar for the alignment
        mapq: the mapping quality
        nm: the number of edits
    """

    reference_name: str
    start: int
    is_forward: bool
    cigar: Cigar
    mapq: int
    nm: int

    def __str__(self) -> str:
        return ",".join(
            str(item)
            for item in (
                self.reference_name,
                self.start + 1,
                "+" if self.is_forward else "-",
                self.cigar,
                self.mapq,
                self.nm,
            )
        )

    @property
    def end(self) -> int:
        """The 0-based exclusive end position of the alignment."""
        return self.start + self.cigar.length_on_target()

    @staticmethod
    def parse(string: str) -> "SupplementaryAlignment":
        """Returns a supplementary alignment parsed from the given string.  The various fields
        should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`)
        """
        fields = string.split(",")
        return SupplementaryAlignment(
            reference_name=fields[0],
            start=int(fields[1]) - 1,
            is_forward=fields[2] == "+",
            cigar=Cigar.from_cigarstring(fields[3]),
            mapq=int(fields[4]),
            nm=int(fields[5]),
        )

    @staticmethod
    def parse_sa_tag(tag: str) -> List["SupplementaryAlignment"]:
        """Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty
        or contains just a single semi-colon then an empty list will be returned.  Otherwise
        a list containing a SupplementaryAlignment per ;-separated value in the tag will
        be returned.
        """
        return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]

    @classmethod
    def from_read(cls, read: pysam.AlignedSegment) -> List["SupplementaryAlignment"]:
        """
        Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

        Args:
            read: An alignment. The presence of the "SA" tag is not required.

        Returns:
            A list of all SupplementaryAlignments present in the SA tag.
            If the SA tag is not present, or it is empty, an empty list will be returned.
        """
        if read.has_tag("SA"):
            sa_tag: str = cast(str, read.get_tag("SA"))
            return cls.parse_sa_tag(sa_tag)
        else:
            return []

Attributes

end property
end: int

The 0-based exclusive end position of the alignment.

Functions

from_read classmethod
from_read(read: AlignedSegment) -> List[SupplementaryAlignment]

Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

Parameters:

Name Type Description Default
read AlignedSegment

An alignment. The presence of the "SA" tag is not required.

required

Returns:

Type Description
List[SupplementaryAlignment]

A list of all SupplementaryAlignments present in the SA tag.

List[SupplementaryAlignment]

If the SA tag is not present, or it is empty, an empty list will be returned.

Source code in fgpyo/sam/__init__.py
@classmethod
def from_read(cls, read: pysam.AlignedSegment) -> List["SupplementaryAlignment"]:
    """
    Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

    Args:
        read: An alignment. The presence of the "SA" tag is not required.

    Returns:
        A list of all SupplementaryAlignments present in the SA tag.
        If the SA tag is not present, or it is empty, an empty list will be returned.
    """
    if read.has_tag("SA"):
        sa_tag: str = cast(str, read.get_tag("SA"))
        return cls.parse_sa_tag(sa_tag)
    else:
        return []
parse staticmethod
parse(string: str) -> SupplementaryAlignment

Returns a supplementary alignment parsed from the given string. The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4)

Source code in fgpyo/sam/__init__.py
@staticmethod
def parse(string: str) -> "SupplementaryAlignment":
    """Returns a supplementary alignment parsed from the given string.  The various fields
    should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`)
    """
    fields = string.split(",")
    return SupplementaryAlignment(
        reference_name=fields[0],
        start=int(fields[1]) - 1,
        is_forward=fields[2] == "+",
        cigar=Cigar.from_cigarstring(fields[3]),
        mapq=int(fields[4]),
        nm=int(fields[5]),
    )
parse_sa_tag staticmethod
parse_sa_tag(tag: str) -> List[SupplementaryAlignment]

Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.

Source code in fgpyo/sam/__init__.py
@staticmethod
def parse_sa_tag(tag: str) -> List["SupplementaryAlignment"]:
    """Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty
    or contains just a single semi-colon then an empty list will be returned.  Otherwise
    a list containing a SupplementaryAlignment per ;-separated value in the tag will
    be returned.
    """
    return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]

Template

A container for alignment records corresponding to a single sequenced template or insert.

It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.

In the special cases there are alignments records that are both secondary and supplementary then they will be stored upon the r1_supplementals and r2_supplementals fields only.

Attributes:

Name Type Description
name str

the name of the template/query

r1 Optional[AlignedSegment]

Primary non-supplementary alignment for read 1, or None if there is none

r2 Optional[AlignedSegment]

Primary non-supplementary alignment for read 2, or None if there is none

r1_supplementals List[AlignedSegment]

Supplementary alignments for read 1

r2_supplementals List[AlignedSegment]

Supplementary alignments for read 2

r1_secondaries List[AlignedSegment]

Secondary (non-primary, non-supplementary) alignments for read 1

r2_secondaries List[AlignedSegment]

Secondary (non-primary, non-supplementary) alignments for read 2

Source code in fgpyo/sam/__init__.py
@attr.s(frozen=True, auto_attribs=True)
class Template:
    """A container for alignment records corresponding to a single sequenced template
    or insert.

    It is strongly preferred that new Template instances be created with `Template.build()`
    which will ensure that reads are stored in the correct Template property, and run basic
    validations of the Template by default.  If constructing Template instances by construction
    users are encouraged to use the validate method post-construction.

    In the special cases there are alignments records that are _*both secondary and supplementary*_
    then they will be stored upon the `r1_supplementals` and `r2_supplementals` fields only.

    Attributes:
        name: the name of the template/query
        r1: Primary non-supplementary alignment for read 1, or None if there is none
        r2: Primary non-supplementary alignment for read 2, or None if there is none
        r1_supplementals: Supplementary alignments for read 1
        r2_supplementals: Supplementary alignments for read 2
        r1_secondaries: Secondary (non-primary, non-supplementary) alignments for read 1
        r2_secondaries: Secondary (non-primary, non-supplementary) alignments for read 2
    """

    name: str
    r1: Optional[AlignedSegment]
    r2: Optional[AlignedSegment]
    r1_supplementals: List[AlignedSegment]
    r2_supplementals: List[AlignedSegment]
    r1_secondaries: List[AlignedSegment]
    r2_secondaries: List[AlignedSegment]

    @staticmethod
    def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
        """Returns an iterator over templates. Assumes the input iterable is queryname grouped,
        and gathers consecutive runs of records sharing a common query name into templates."""
        return TemplateIterator(alns)

    @staticmethod
    def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
        """Build a template from a set of records all with the same queryname."""
        name = None
        r1 = None
        r2 = None
        r1_supplementals: List[AlignedSegment] = []
        r2_supplementals: List[AlignedSegment] = []
        r1_secondaries: List[AlignedSegment] = []
        r2_secondaries: List[AlignedSegment] = []

        for rec in recs:
            if name is None:
                name = rec.query_name

            is_r1 = not rec.is_paired or rec.is_read1

            if not rec.is_supplementary and not rec.is_secondary:
                if is_r1:
                    assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                    r1 = rec
                else:
                    assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                    r2 = rec
            elif rec.is_supplementary:
                if is_r1:
                    r1_supplementals.append(rec)
                else:
                    r2_supplementals.append(rec)
            elif rec.is_secondary:
                if is_r1:
                    r1_secondaries.append(rec)
                else:
                    r2_secondaries.append(rec)

        assert name is not None, "Cannot construct a template from zero records."

        template = Template(
            name=name,
            r1=r1,
            r2=r2,
            r1_supplementals=r1_supplementals,
            r2_supplementals=r2_supplementals,
            r1_secondaries=r1_secondaries,
            r2_secondaries=r2_secondaries,
        )

        if validate:
            template.validate()

        return template

    def validate(self) -> None:
        """Performs sanity checks that all the records in the Template are as expected."""
        for rec in self.all_recs():
            assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

        if self.r1 is not None:
            assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
            assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
            assert not self.r1.is_secondary, "R1 primary flagged as secondary"

        if self.r2 is not None:
            assert self.r2.is_read2, "R2 not flagged as R2"
            assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
            assert not self.r2.is_secondary, "R2 primary flagged as secondary"

        for rec in self.r1_secondaries:
            assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
            assert rec.is_secondary, "R1 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

        for rec in self.r1_supplementals:
            assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
            assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

        for rec in self.r2_secondaries:
            assert rec.is_read2, "R2 secondary not flagged as R2"
            assert rec.is_secondary, "R2 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

        for rec in self.r2_supplementals:
            assert rec.is_read2, "R2 supp. not flagged as R2"
            assert rec.is_supplementary, "R2 supp. not flagged as supplementary"

    def primary_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the primary records for the template."""
        return (r for r in (self.r1, self.r2) if r is not None)

    def all_r1s(self) -> Iterator[AlignedSegment]:
        """Yields all R1 alignments of this template including secondary and supplementary."""
        r1_primary = [] if self.r1 is None else [self.r1]
        return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)

    def all_r2s(self) -> Iterator[AlignedSegment]:
        """Yields all R2 alignments of this template including secondary and supplementary."""
        r2_primary = [] if self.r2 is None else [self.r2]
        return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)

    def all_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the records for the template."""
        for rec in self.primary_recs():
            yield rec

        for recs in (
            self.r1_supplementals,
            self.r1_secondaries,
            self.r2_supplementals,
            self.r2_secondaries,
        ):
            for rec in recs:
                yield rec

    def set_mate_info(
        self,
        is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
        isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
    ) -> Self:
        """Reset all mate information on every alignment in the template.

        Args:
            is_proper_pair: A function that takes two alignments and determines proper pair status.
            isize: A function that takes the two alignments and calculates their isize.
        """
        if self.r1 is not None and self.r2 is not None:
            set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
        if self.r1 is not None:
            for rec in self.r2_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
            for rec in self.r2_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
        if self.r2 is not None:
            for rec in self.r1_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
            for rec in self.r1_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
        return self

    def write_to(
        self,
        writer: SamFile,
        primary_only: bool = False,
    ) -> None:
        """Write the records associated with the template to file.

        Args:
            writer: An open, writable AlignmentFile.
            primary_only: If True, only write primary alignments.
        """

        if primary_only:
            rec_iter = self.primary_recs()
        else:
            rec_iter = self.all_recs()

        for rec in rec_iter:
            writer.write(rec)

    def set_tag(
        self,
        tag: str,
        value: Union[str, int, float, None],
    ) -> None:
        """Add a tag to all records associated with the template.

        Setting a tag to `None` will remove the tag.

        Args:
            tag: The name of the tag.
            value: The value of the tag.
        """

        assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

        for rec in self.all_recs():
            rec.set_tag(tag, value)

Functions

all_r1s
all_r1s() -> Iterator[AlignedSegment]

Yields all R1 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py
def all_r1s(self) -> Iterator[AlignedSegment]:
    """Yields all R1 alignments of this template including secondary and supplementary."""
    r1_primary = [] if self.r1 is None else [self.r1]
    return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)
all_r2s
all_r2s() -> Iterator[AlignedSegment]

Yields all R2 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py
def all_r2s(self) -> Iterator[AlignedSegment]:
    """Yields all R2 alignments of this template including secondary and supplementary."""
    r2_primary = [] if self.r2 is None else [self.r2]
    return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)
all_recs
all_recs() -> Iterator[AlignedSegment]

Returns a list with all the records for the template.

Source code in fgpyo/sam/__init__.py
def all_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the records for the template."""
    for rec in self.primary_recs():
        yield rec

    for recs in (
        self.r1_supplementals,
        self.r1_secondaries,
        self.r2_supplementals,
        self.r2_secondaries,
    ):
        for rec in recs:
            yield rec
build staticmethod
build(recs: Iterable[AlignedSegment], validate: bool = True) -> Template

Build a template from a set of records all with the same queryname.

Source code in fgpyo/sam/__init__.py
@staticmethod
def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
    """Build a template from a set of records all with the same queryname."""
    name = None
    r1 = None
    r2 = None
    r1_supplementals: List[AlignedSegment] = []
    r2_supplementals: List[AlignedSegment] = []
    r1_secondaries: List[AlignedSegment] = []
    r2_secondaries: List[AlignedSegment] = []

    for rec in recs:
        if name is None:
            name = rec.query_name

        is_r1 = not rec.is_paired or rec.is_read1

        if not rec.is_supplementary and not rec.is_secondary:
            if is_r1:
                assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                r1 = rec
            else:
                assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                r2 = rec
        elif rec.is_supplementary:
            if is_r1:
                r1_supplementals.append(rec)
            else:
                r2_supplementals.append(rec)
        elif rec.is_secondary:
            if is_r1:
                r1_secondaries.append(rec)
            else:
                r2_secondaries.append(rec)

    assert name is not None, "Cannot construct a template from zero records."

    template = Template(
        name=name,
        r1=r1,
        r2=r2,
        r1_supplementals=r1_supplementals,
        r2_supplementals=r2_supplementals,
        r1_secondaries=r1_secondaries,
        r2_secondaries=r2_secondaries,
    )

    if validate:
        template.validate()

    return template
iterator staticmethod
iterator(alns: Iterator[AlignedSegment]) -> Iterator[Template]

Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.

Source code in fgpyo/sam/__init__.py
@staticmethod
def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
    """Returns an iterator over templates. Assumes the input iterable is queryname grouped,
    and gathers consecutive runs of records sharing a common query name into templates."""
    return TemplateIterator(alns)
primary_recs
primary_recs() -> Iterator[AlignedSegment]

Returns a list with all the primary records for the template.

Source code in fgpyo/sam/__init__.py
def primary_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the primary records for the template."""
    return (r for r in (self.r1, self.r2) if r is not None)
set_mate_info
set_mate_info(is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> Self

Reset all mate information on every alignment in the template.

Parameters:

Name Type Description Default
is_proper_pair Callable[[AlignedSegment, AlignedSegment], bool]

A function that takes two alignments and determines proper pair status.

is_proper_pair
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize
Source code in fgpyo/sam/__init__.py
def set_mate_info(
    self,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> Self:
    """Reset all mate information on every alignment in the template.

    Args:
        is_proper_pair: A function that takes two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.
    """
    if self.r1 is not None and self.r2 is not None:
        set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
    if self.r1 is not None:
        for rec in self.r2_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
        for rec in self.r2_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
    if self.r2 is not None:
        for rec in self.r1_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
        for rec in self.r1_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
    return self
set_tag
set_tag(tag: str, value: Union[str, int, float, None]) -> None

Add a tag to all records associated with the template.

Setting a tag to None will remove the tag.

Parameters:

Name Type Description Default
tag str

The name of the tag.

required
value Union[str, int, float, None]

The value of the tag.

required
Source code in fgpyo/sam/__init__.py
def set_tag(
    self,
    tag: str,
    value: Union[str, int, float, None],
) -> None:
    """Add a tag to all records associated with the template.

    Setting a tag to `None` will remove the tag.

    Args:
        tag: The name of the tag.
        value: The value of the tag.
    """

    assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

    for rec in self.all_recs():
        rec.set_tag(tag, value)
validate
validate() -> None

Performs sanity checks that all the records in the Template are as expected.

Source code in fgpyo/sam/__init__.py
def validate(self) -> None:
    """Performs sanity checks that all the records in the Template are as expected."""
    for rec in self.all_recs():
        assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

    if self.r1 is not None:
        assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
        assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
        assert not self.r1.is_secondary, "R1 primary flagged as secondary"

    if self.r2 is not None:
        assert self.r2.is_read2, "R2 not flagged as R2"
        assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
        assert not self.r2.is_secondary, "R2 primary flagged as secondary"

    for rec in self.r1_secondaries:
        assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
        assert rec.is_secondary, "R1 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

    for rec in self.r1_supplementals:
        assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
        assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

    for rec in self.r2_secondaries:
        assert rec.is_read2, "R2 secondary not flagged as R2"
        assert rec.is_secondary, "R2 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

    for rec in self.r2_supplementals:
        assert rec.is_read2, "R2 supp. not flagged as R2"
        assert rec.is_supplementary, "R2 supp. not flagged as supplementary"
write_to
write_to(writer: AlignmentFile, primary_only: bool = False) -> None

Write the records associated with the template to file.

Parameters:

Name Type Description Default
writer AlignmentFile

An open, writable AlignmentFile.

required
primary_only bool

If True, only write primary alignments.

False
Source code in fgpyo/sam/__init__.py
def write_to(
    self,
    writer: SamFile,
    primary_only: bool = False,
) -> None:
    """Write the records associated with the template to file.

    Args:
        writer: An open, writable AlignmentFile.
        primary_only: If True, only write primary alignments.
    """

    if primary_only:
        rec_iter = self.primary_recs()
    else:
        rec_iter = self.all_recs()

    for rec in rec_iter:
        writer.write(rec)

TemplateIterator

Bases: Iterator[Template]

An iterator that converts an iterator over query-grouped reads into an iterator over templates.

Source code in fgpyo/sam/__init__.py
class TemplateIterator(Iterator[Template]):
    """
    An iterator that converts an iterator over query-grouped reads into an iterator
    over templates.
    """

    def __init__(self, iterator: Iterator[AlignedSegment]) -> None:
        self._iter = PeekableIterator(iterator)

    def __iter__(self) -> Iterator[Template]:
        return self

    def __next__(self) -> Template:
        name = self._iter.peek().query_name
        recs = self._iter.takewhile(lambda r: r.query_name == name)
        return Template.build(recs, validate=False)

Functions

calculate_edit_info

calculate_edit_info(rec: AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) -> ReadEditInfo

Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag. The read must be aligned.

Parameters:

Name Type Description Default
rec AlignedSegment

the read/record for which to calculate values

required
reference_sequence str

the reference sequence (or fragment thereof) that the read is aligned to

required
reference_offset Optional[int]

if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start

None

Returns:

Type Description
ReadEditInfo

a ReadEditInfo with information about how the read differs from the reference

Source code in fgpyo/sam/__init__.py
def calculate_edit_info(
    rec: AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None
) -> ReadEditInfo:
    """
    Constructs a `ReadEditInfo` instance giving summary stats about how the read aligns to the
    reference.  Computes the number of mismatches, indels, indel bases and the SAM NM tag.
    The read must be aligned.

    Args:
        rec: the read/record for which to calculate values
        reference_sequence: the reference sequence (or fragment thereof) that the read is
            aligned to
        reference_offset: if provided, assume that reference_sequence[reference_offset] is the
            first base aligned to in reference_sequence, otherwise use r.reference_start

    Returns:
        a ReadEditInfo with information about how the read differs from the reference
    """
    assert not rec.is_unmapped, f"Cannot calculate edit info for unmapped read: {rec}"

    query_offset = 0
    target_offset = reference_offset if reference_offset is not None else rec.reference_start
    cigar = Cigar.from_cigartuples(rec.cigartuples)

    matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
    ok_bases = {"A", "C", "G", "T"}

    for elem in cigar.elements:
        op = elem.operator

        if op == CigarOp.I:
            insertions += 1
            ins_bases += elem.length
        elif op == CigarOp.D:
            deletions += 1
            del_bases += elem.length
        elif op == CigarOp.M or op == CigarOp.X or op == CigarOp.EQ:
            for i in range(0, elem.length):
                q = rec.query_sequence[query_offset + i].upper()
                t = reference_sequence[target_offset + i].upper()
                if q != t or q not in ok_bases:
                    mms += 1
                else:
                    matches += 1

        query_offset += elem.length_on_query
        target_offset += elem.length_on_target

    return ReadEditInfo(
        matches=matches,
        mismatches=mms,
        insertions=insertions,
        inserted_bases=ins_bases,
        deletions=deletions,
        deleted_bases=del_bases,
        nm=mms + ins_bases + del_bases,
    )

is_proper_pair

is_proper_pair(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None, max_insert_size: int = 1000, orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> bool

Determines if a pair of records are properly paired or not.

Criteria for records in a proper pair are
  • Both records are aligned
  • Both records are aligned to the same reference sequence
  • The pair orientation of the records is one of the valid pair orientations (default "FR")
  • The inferred insert size is not more than a maximum length (default 1000)

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
max_insert_size int

The maximum insert size to consider a pair "proper".

1000
orientations Collection[PairOrientation]

The valid set of orientations to consider a pair "proper".

DefaultProperlyPairedOrientations
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize
See

htsjdk.samtools.SamPairUtil.isProperPair()

Source code in fgpyo/sam/__init__.py
def is_proper_pair(
    rec1: AlignedSegment,
    rec2: Optional[AlignedSegment] = None,
    max_insert_size: int = 1000,
    orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> bool:
    """Determines if a pair of records are properly paired or not.

    Criteria for records in a proper pair are:
        - Both records are aligned
        - Both records are aligned to the same reference sequence
        - The pair orientation of the records is one of the valid pair orientations (default "FR")
        - The inferred insert size is not more than a maximum length (default 1000)

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
        max_insert_size: The maximum insert size to consider a pair "proper".
        orientations: The valid set of orientations to consider a pair "proper".
        isize: A function that takes the two alignments and calculates their isize.

    See:
        [`htsjdk.samtools.SamPairUtil.isProperPair()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L106-L125)
    """
    if rec2 is None:
        rec2_is_mapped = rec1.mate_is_mapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_mapped = rec2.is_mapped
        rec2_reference_id = rec2.reference_id

    return (
        rec1.is_mapped
        and rec2_is_mapped
        and rec1.reference_id == rec2_reference_id
        and PairOrientation.from_recs(rec1=rec1, rec2=rec2) in orientations
        and 0 < abs(isize(rec1, rec2)) <= max_insert_size
    )

isize

isize(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> int

Computes the insert size ("template length" or "TLEN") for a pair of records.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 Optional[AlignedSegment]

The second record in the pair. If None, then mate info on rec1 will be used.

None
Source code in fgpyo/sam/__init__.py
def isize(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> int:
    """Computes the insert size ("template length" or "TLEN") for a pair of records.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
    """
    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return 0

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    if rec1.is_forward and rec2_is_forward:
        return rec2_reference_start - rec1.reference_start
    if rec1.is_reverse and rec2_is_forward:
        return rec2_reference_start - rec1.reference_end

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine proper pair status without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        rec2_reference_end = rec2.reference_end

    if rec1.is_forward:
        return rec2_reference_end - rec1.reference_start
    else:
        return rec2_reference_end - rec1.reference_end

reader

reader(path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False) -> AlignmentFile

Opens a SAM/BAM/CRAM for reading.

To read from standard input, provide any of "-", "stdin", or "/dev/stdin" as the input path.

Parameters:

Name Type Description Default
path SamPath

a file handle or path to the SAM/BAM/CRAM to read or write.

required
file_type Optional[SamFileType]

the file type to assume when opening the file. If None, then the file type will be auto-detected.

None
unmapped bool

True if the file is unmapped and has no sequence dictionary, False otherwise.

False
Source code in fgpyo/sam/__init__.py
def reader(
    path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False
) -> SamFile:
    """Opens a SAM/BAM/CRAM for reading.

    To read from standard input, provide any of `"-"`, `"stdin"`, or `"/dev/stdin"` as the input
    `path`.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        file_type: the file type to assume when opening the file.  If None, then the file
            type will be auto-detected.
        unmapped: True if the file is unmapped and has no sequence dictionary, False otherwise.
    """
    return _pysam_open(path=path, open_for_reading=True, file_type=file_type, unmapped=unmapped)

set_mate_info

set_mate_info(rec1: AlignedSegment, rec2: AlignedSegment, is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> None

Resets mate pair information between two primary alignments that share a query name.

Parameters:

Name Type Description Default
rec1 AlignedSegment

The first record in the pair.

required
rec2 AlignedSegment

The second record in the pair.

required
is_proper_pair Callable[[AlignedSegment, AlignedSegment], bool]

A function that takes the two alignments and determines proper pair status.

is_proper_pair
isize Callable[[AlignedSegment, AlignedSegment], int]

A function that takes the two alignments and calculates their isize.

isize

Raises:

Type Description
ValueError

If rec1 and rec2 are of the same read ordinal.

ValueError

If either rec1 or rec2 is secondary or supplementary.

ValueError

If rec1 and rec2 do not share the same query name.

Source code in fgpyo/sam/__init__.py
def set_mate_info(
    rec1: AlignedSegment,
    rec2: AlignedSegment,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> None:
    """Resets mate pair information between two primary alignments that share a query name.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair.
        is_proper_pair: A function that takes the two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.

    Raises:
        ValueError: If rec1 and rec2 are of the same read ordinal.
        ValueError: If either rec1 or rec2 is secondary or supplementary.
        ValueError: If rec1 and rec2 do not share the same query name.
    """
    for dest, source in [(rec1, rec2), (rec2, rec1)]:
        _set_common_mate_fields(dest=dest, mate_primary=source)

    template_length = isize(rec1, rec2)
    rec1.template_length = template_length
    rec2.template_length = -template_length

    proper_pair = is_proper_pair(rec1, rec2)
    rec1.is_proper_pair = proper_pair
    rec2.is_proper_pair = proper_pair

set_mate_info_on_secondary

set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a secondary alignment from its mate's primary alignment.

Parameters:

Name Type Description Default
secondary AlignedSegment

The secondary alignment to set mate information upon.

required
mate_primary AlignedSegment

The primary alignment of the secondary's mate.

required

Raises:

Type Description
ValueError

If secondary and mate_primary are of the same read ordinal.

ValueError

If secondary and mate_primary do not share the same query name.

ValueError

If mate_primary is secondary or supplementary.

ValueError

If secondary is not marked as a secondary alignment.

Source code in fgpyo/sam/__init__.py
def set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """Set mate info on a secondary alignment from its mate's primary alignment.

    Args:
        secondary: The secondary alignment to set mate information upon.
        mate_primary: The primary alignment of the secondary's mate.

    Raises:
        ValueError: If secondary and mate_primary are of the same read ordinal.
        ValueError: If secondary and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If secondary is not marked as a secondary alignment.
    """
    if not secondary.is_secondary:
        raise ValueError("Cannot set mate info on an alignment not marked as secondary!")

    _set_common_mate_fields(dest=secondary, mate_primary=mate_primary)

set_mate_info_on_supplementary

set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a supplementary alignment from its mate's primary alignment.

Parameters:

Name Type Description Default
supp AlignedSegment

The supplementary alignment to set mate information upon.

required
mate_primary AlignedSegment

The primary alignment of the supplementary's mate.

required

Raises:

Type Description
ValueError

If supp and mate_primary are of the same read ordinal.

ValueError

If supp and mate_primary do not share the same query name.

ValueError

If mate_primary is secondary or supplementary.

ValueError

If supp is not marked as a supplementary alignment.

Source code in fgpyo/sam/__init__.py
def set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """Set mate info on a supplementary alignment from its mate's primary alignment.

    Args:
        supp: The supplementary alignment to set mate information upon.
        mate_primary: The primary alignment of the supplementary's mate.

    Raises:
        ValueError: If supp and mate_primary are of the same read ordinal.
        ValueError: If supp and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If supp is not marked as a supplementary alignment.
    """
    if not supp.is_supplementary:
        raise ValueError("Cannot set mate info on an alignment not marked as supplementary!")

    _set_common_mate_fields(dest=supp, mate_primary=mate_primary)

    # NB: for a non-secondary supplemental alignment, set the following to the same as the primary.
    if not supp.is_secondary:
        supp.is_proper_pair = mate_primary.is_proper_pair
        supp.template_length = -mate_primary.template_length

set_pair_info

set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None

Resets mate pair information between reads in a pair.

Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters:

Name Type Description Default
r1 AlignedSegment

Read 1 (first read in the template).

required
r2 AlignedSegment

Read 2 with the same query name as r1 (second read in the template).

required
proper_pair bool

whether the pair is proper or not.

True
Source code in fgpyo/sam/__init__.py
@deprecated("Use `set_mate_info()` instead. Deprecated after fgpyo 0.8.0.")
def set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None:
    """Resets mate pair information between reads in a pair.

    Can be handed reads that already have pairing flags setup or independent R1 and R2 records that
    are currently flagged as SE reads.

    Args:
        r1: Read 1 (first read in the template).
        r2: Read 2 with the same query name as r1 (second read in the template).
        proper_pair: whether the pair is proper or not.
    """
    if r1.query_name != r2.query_name:
        raise ValueError("Cannot set pair info on reads with different query names!")

    for r in [r1, r2]:
        r.is_paired = True

    r1.is_read1 = True
    r1.is_read2 = False
    r2.is_read2 = True
    r2.is_read1 = False

    set_mate_info(rec1=r1, rec2=r2, is_proper_pair=lambda a, b: proper_pair)

sum_of_base_qualities

sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int

Calculate the sum of base qualities score for an alignment record.

This function is useful for calculating the "mate score" as implemented in samtools fixmate. Consistently with samtools fixmate, this function returns 0 if the record has no base qualities.

Parameters:

Name Type Description Default
rec AlignedSegment

The alignment record to calculate the sum of base qualities from.

required
min_quality_score int

The minimum base quality score to use for summation.

15

Returns:

Type Description
int

The sum of base qualities on the input record. 0 if the record has no base qualities.

See

calc_sum_of_base_qualities() MD_MIN_QUALITY

Source code in fgpyo/sam/__init__.py
def sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int:
    """Calculate the sum of base qualities score for an alignment record.

    This function is useful for calculating the "mate score" as implemented in `samtools fixmate`.
    Consistently with `samtools fixmate`, this function returns 0 if the record has no base
    qualities.

    Args:
        rec: The alignment record to calculate the sum of base qualities from.
        min_quality_score: The minimum base quality score to use for summation.

    Returns:
        The sum of base qualities on the input record. 0 if the record has no base qualities.

    See:
        [`calc_sum_of_base_qualities()`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L227-L238)
        [`MD_MIN_QUALITY`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L42)
    """
    if rec.query_qualities is None or rec.query_qualities == NO_QUERY_QUALITIES:
        return 0

    score: int = sum(qual for qual in rec.query_qualities if qual >= min_quality_score)
    return score

writer

writer(path: SamPath, header: Union[str, Dict[str, Any], AlignmentHeader], file_type: Optional[SamFileType] = None) -> AlignmentFile

Opens a SAM/BAM/CRAM for writing.

To write to standard output, provide any of "-", "stdout", or "/dev/stdout" as the output path. Note: When writing to stdout, the file_type must be given.

Parameters:

Name Type Description Default
path SamPath

a file handle or path to the SAM/BAM/CRAM to read or write.

required
header Union[str, Dict[str, Any], AlignmentHeader]

Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.

required
file_type Optional[SamFileType]

the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object. This argument is required when writing to standard output.

None
Source code in fgpyo/sam/__init__.py
def writer(
    path: SamPath,
    header: Union[str, Dict[str, Any], SamHeader],
    file_type: Optional[SamFileType] = None,
) -> SamFile:
    """Opens a SAM/BAM/CRAM for writing.

    To write to standard output, provide any of `"-"`, `"stdout"`, or `"/dev/stdout"` as the output
    `path`. **Note**: When writing to `stdout`, the `file_type` _must_ be given.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        header: Either a string to use for the header or a multi-level dictionary.  The
            multi-level dictionary should be given as follows.  The first level are the four
            types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being
            a list of tag-value pairs. The header is constructed first from all the defined
            fields, followed by user tags in alphabetical order.
        file_type: the file type to assume when opening the file.  If `None`, then the
            filetype will be auto-detected and must be a path-like object. This argument is required
            when writing to standard output.
    """
    # Set the header for pysam's AlignmentFile
    key = "text" if isinstance(header, str) else "header"
    kwargs = {key: header}

    return _pysam_open(
        path=path, open_for_reading=False, file_type=file_type, unmapped=False, **kwargs
    )

Modules

builder

Classes for generating SAM and BAM files and records for testing

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

Classes

SamBuilder

Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file using to_path().

Source code in fgpyo/sam/builder.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
class SamBuilder:
    """Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

    Provides the ability to manufacture records from minimal arguments, while generating
    any remaining attributes to ensure a valid record.

    A builder is constructed with a handful of defaults including lengths for generated R1s
    and R2s, the default base quality score to use, a sequence dictionary and a single read group.

    Records are then added using the [`add_pair()`][fgpyo.sam.builder.SamBuilder.add_pair]
    method.  Once accumulated the records can be accessed in the order in which they were created
    through the [`to_unsorted_list()`][fgpyo.sam.builder.SamBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.sam.builder.SamBuilder.to_sorted_list].  The latter creates
    a temporary file to do the sorting and is somewhat slower as a result.  Lastly, the records can
    be written to a temporary file using
    [`to_path()`][fgpyo.sam.builder.SamBuilder.to_path].
    """

    # The default read one length
    DEFAULT_R1_LENGTH: int = 100

    # The default read two length
    DEFAULT_R2_LENGTH: int = 100

    @staticmethod
    def default_sd() -> List[Dict[str, Any]]:
        """Generates the sequence dictionary that is used by default by SamBuilder.

        Matches the names and lengths of the HG19 reference in use in production.

        Returns:
            A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
        """
        return [
            {"SN": "chr1", "LN": 249250621},
            {"SN": "chr2", "LN": 243199373},
            {"SN": "chr3", "LN": 198022430},
            {"SN": "chr4", "LN": 191154276},
            {"SN": "chr5", "LN": 180915260},
            {"SN": "chr6", "LN": 171115067},
            {"SN": "chr7", "LN": 159138663},
            {"SN": "chr8", "LN": 146364022},
            {"SN": "chr9", "LN": 141213431},
            {"SN": "chr10", "LN": 135534747},
            {"SN": "chr11", "LN": 135006516},
            {"SN": "chr12", "LN": 133851895},
            {"SN": "chr13", "LN": 115169878},
            {"SN": "chr14", "LN": 107349540},
            {"SN": "chr15", "LN": 102531392},
            {"SN": "chr16", "LN": 90354753},
            {"SN": "chr17", "LN": 81195210},
            {"SN": "chr18", "LN": 78077248},
            {"SN": "chr19", "LN": 59128983},
            {"SN": "chr20", "LN": 63025520},
            {"SN": "chr21", "LN": 48129895},
            {"SN": "chr22", "LN": 51304566},
            {"SN": "chrX", "LN": 155270560},
            {"SN": "chrY", "LN": 59373566},
            {"SN": "chrM", "LN": 16571},
        ]

    @staticmethod
    def default_rg() -> Dict[str, str]:
        """Returns the default read group used by the SamBuilder, as a dictionary."""
        return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}

    def __init__(
        self,
        r1_len: Optional[int] = None,
        r2_len: Optional[int] = None,
        base_quality: int = 30,
        mapping_quality: int = 60,
        sd: Optional[List[Dict[str, Any]]] = None,
        rg: Optional[Dict[str, str]] = None,
        extra_header: Optional[Dict[str, Any]] = None,
        seed: int = 42,
        sort_order: SamOrder = SamOrder.Coordinate,
    ) -> None:
        """Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

        Args:
            r1_len: The length of R1s to create unless otherwise specified
            r2_len: The length of R2s to create unless otherwise specified
            base_quality: The base quality of bases to create unless otherwise specified
            sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
            rg: a single read group as a dict; defaults to calling default_sd() if None
            extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                          `pysam.AlignmentHeader` for more details.
            seed: a seed value for random number/string generation
            sort_order: Order to sort records when writing to file, or output of to_sorted_list()
        """

        self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
        self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
        self.base_quality: int = base_quality
        self.mapping_quality: int = mapping_quality

        if not isinstance(sort_order, SamOrder):
            raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
        self._sort_order = sort_order

        self._header: Dict[str, Any] = {
            "HD": {"VN": "1.5", "SO": sort_order.value},
            "SQ": (sd if sd is not None else SamBuilder.default_sd()),
            "RG": [(rg if rg is not None else SamBuilder.default_rg())],
        }
        if extra_header is not None:
            self._header = {**self._header, **extra_header}
        self._samheader = AlignmentHeader.from_dict(self._header)
        self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

        self._random: Random = Random(seed)
        self._records: List[AlignedSegment] = []
        self._counter: int = 0

    def _next_name(self) -> str:
        """Returns the next available query/template name."""
        n = self._counter
        self._counter += 1
        return f"q{n:>04}"

    def _bases(self, length: int) -> str:
        """Returns a random string of bases of the length requested."""
        return "".join(self._random.choices("ACGT", k=length))

    def _new_rec(
        self,
        name: str,
        chrom: str,
        start: int,
        mapq: Optional[int],
        attrs: Optional[Dict[str, Any]],
    ) -> AlignedSegment:
        """Generates a new AlignedSegment.  Sets the segment up with the correct
        header and adds the RG attribute if not contained in attrs.

        Args:
            name: the name of the read/template
            chrom: the chromosome to which the read is mapped
            start: the start position of the read on the chromosome
            mapq: an optional mapping quality; use self.mapping_quality if None
            attrs: an optional dictionary of SAM attributes with two-char keys

        Returns:
            AlignedSegment: an aligned segment with name, chrom, pos, attributes the
                read group, and the unmapped flag all set appropriately.
        """
        if chrom is not sam.NO_REF_NAME and chrom not in self._seq_lookup:
            raise ValueError(f"{chrom} is not a valid chromosome name in this builder.")

        rec = AlignedSegment(header=self._samheader)
        rec.query_name = name
        rec.reference_name = chrom
        rec.reference_start = start
        rec.mapping_quality = mapq if mapq is not None else self.mapping_quality

        if chrom == sam.NO_REF_NAME or start == sam.NO_REF_POS:
            rec.is_unmapped = True
            rec.mapping_quality = 0

        attrs = attrs if attrs else dict()
        if "RG" not in attrs:
            attrs["RG"] = self.rg_id()
        rec.set_tags(list(attrs.items()))
        return rec

    def _set_flags(
        self,
        rec: pysam.AlignedSegment,
        read_num: Optional[int],
        strand: str,
        secondary: bool = False,
        supplementary: bool = False,
    ) -> None:
        """Appropriately sets most flag fields on the given read.

        Args:
            rec: the read to set the flags on
            read_num: Either None for an unpaired read, or 1 or 2
            strand: Either "+" or "-" to indicate strand of the read
        """
        rec.is_paired = read_num is not None
        rec.is_read1 = read_num == 1
        rec.is_read2 = read_num == 2
        rec.is_qcfail = False
        rec.is_duplicate = False
        rec.is_secondary = secondary
        rec.is_supplementary = supplementary
        if not rec.is_unmapped:
            rec.is_reverse = strand != "+"

    def _set_length_dependent_fields(
        self,
        rec: pysam.AlignedSegment,
        length: int,
        bases: Optional[str] = None,
        quals: Optional[List[int]] = None,
        cigar: Optional[str] = None,
    ) -> None:
        """Fills in bases, quals and cigar on a record.

        If any of bases, quals or cigar are defined, they must all have the same length/query
        length.  If none are defined then the length parameter is used.  Undefined values are
        synthesize at the inferred length.

        Args:
            rec: a SAM record
            length: the length to use if all of bases/quals/cigar are None
            bases: an optional string of bases for the read
            quals: an optional list of qualities for the read
            cigar: an optional cigar string for the read
        """

        # Do some validation to make sure all defined things have the same lengths
        lengths = set()
        if bases is not None:
            lengths.add(len(bases))
        if quals is not None:
            lengths.add(len(quals))
        if cigar is not None:
            cig = sam.Cigar.from_cigarstring(cigar)
            lengths.add(sum([elem.length_on_query for elem in cig.elements]))

        if not lengths:
            lengths.add(length)

        if len(lengths) != 1:
            raise ValueError("Provided bases/quals/cigar are not length compatible.")

        # Fill in the record, making any parts that were not defined as params
        length = lengths.pop()
        query_quals = array("B", quals if quals else [self.base_quality] * length)
        rec.query_sequence = bases if bases else self._bases(length)
        rec.query_qualities = query_quals
        if not rec.is_unmapped:
            rec.cigarstring = cigar if cigar else f"{length}M"

    def rg(self) -> Dict[str, Any]:
        """Returns the single read group that is defined in the header."""
        # The `RG` field contains a list of read group mappings
        # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
        rgs = cast(List[Dict[str, Any]], self._header["RG"])
        assert len(rgs) == 1, "Header did not contain exactly one read group!"
        return rgs[0]

    def rg_id(self) -> str:
        """Returns the ID of the single read group that is defined in the header."""
        # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
        # field is always a string.
        return cast(str, self.rg()["ID"])

    def add_pair(
        self,
        *,
        name: Optional[str] = None,
        bases1: Optional[str] = None,
        bases2: Optional[str] = None,
        quals1: Optional[List[int]] = None,
        quals2: Optional[List[int]] = None,
        chrom: Optional[str] = None,
        chrom1: Optional[str] = None,
        chrom2: Optional[str] = None,
        start1: int = sam.NO_REF_POS,
        start2: int = sam.NO_REF_POS,
        cigar1: Optional[str] = None,
        cigar2: Optional[str] = None,
        mapq1: Optional[int] = None,
        mapq2: Optional[int] = None,
        strand1: str = "+",
        strand2: str = "-",
        attrs: Optional[Dict[str, Any]] = None,
    ) -> Tuple[AlignedSegment, AlignedSegment]:
        """Generates a new pair of reads, adds them to the internal collection, and returns them.

        Most fields are optional.

        Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
        pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
        where reads map to different contigs. i.e.:

            - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
              the same contig (`chrom`).
            - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
              map to different contigs (`chrom1` and `chrom2`).

        A pair with only one of the two reads mapped can be created by setting only one start
        position. Flags will automatically be set correctly for the unmapped mate.

            - `add_pair(chrom, start1)`
            - `add_pair(chrom1, start1)`
            - `add_pair(chrom, start2)`
            - `add_pair(chrom2, start2)`

        An unmapped pair can be created by calling the method with no parameters (specifically,
        not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
        provided, it will be ignored.

        For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
        or absence of bases, quals, and cigar.  If values are provided for one or more of these
        parameters, the lengths must match, and the length will be used to generate any
        unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
        synthesized based on either the r1_len or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            bases1: The bases for R1. If None is given a random sequence is generated.
            bases2: The bases for R2. If None is given a random sequence is generated.
            quals1: The list of int qualities for R1. If None, the default base quality is used.
            quals2: The list of int qualities for R2. If None, the default base quality is used.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
            chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
            start1: The start position of R1. Defaults to the unmapped value.
            start2: The start position of R2. Defaults to the unmapped value.
            cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
            mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
            mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
            strand1: The strand for R1, either "+" or "-". Defaults to "+".
            strand2: The strand for R2, either "+" or "-". Defaults to "-".
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if either strand field is not "+" or "-"
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
        """

        if strand1 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand1}")
        if strand2 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand2: {strand2}")

        name = name if name is not None else self._next_name()

        # Valid parameterizations for contig mapping (backward compatible):
        # - chrom, start1, start2
        # - chrom, start1
        # - chrom, start2
        # Valid parameterizations for contig mapping (new):
        # - chrom1, start1, chrom2, start2
        # - chrom1, start1
        # - chrom2, start2
        if chrom is not None and (chrom1 is not None or chrom2 is not None):
            raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

        chrom = sam.NO_REF_NAME if chrom is None else chrom

        if start1 != sam.NO_REF_POS:
            chrom1 = next(c for c in (chrom1, chrom) if c is not None)
        else:
            chrom1 = sam.NO_REF_NAME

        if start2 != sam.NO_REF_POS:
            chrom2 = next(c for c in (chrom2, chrom) if c is not None)
        else:
            chrom2 = sam.NO_REF_NAME

        if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
            raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

        if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
            raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

        # Setup R1
        r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
        self._set_flags(r1, read_num=1, strand=strand1)
        self._set_length_dependent_fields(
            rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
        )

        # Setup R2
        r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
        self._set_flags(r2, read_num=2, strand=strand2)
        self._set_length_dependent_fields(
            rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
        )

        # Sync up mate info and we're done!
        sam.set_mate_info(r1, r2)
        self._records.append(r1)
        self._records.append(r2)
        return r1, r2

    def add_single(
        self,
        *,
        name: Optional[str] = None,
        read_num: Optional[int] = None,
        bases: Optional[str] = None,
        quals: Optional[List[int]] = None,
        chrom: str = sam.NO_REF_NAME,
        start: int = sam.NO_REF_POS,
        cigar: Optional[str] = None,
        mapq: Optional[int] = None,
        strand: str = "+",
        secondary: bool = False,
        supplementary: bool = False,
        attrs: Optional[Dict[str, Any]] = None,
    ) -> AlignedSegment:
        """Generates a new single reads, adds them to the internal collection, and returns it.

        Most fields are optional.

        If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
        set to 1 or 2, the read will have it's paired flag set and read number flags set.

        An unmapped read can be created by calling the method with no parameters (specifically,
        not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

        A mapped read is created by providing chrom and start.

        The length of the read is determined based on the presence or absence of bases, quals,
        and cigar.  If values are provided for one or more of these parameters, the lengths must
        match, and the length will be used to generate any unsupplied values.  If none of bases,
        quals, and cigar are provided, all three will be synthesized based on either the r1_len
        or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            read_num: Either None, 1 for R1 or 2 for R2
            bases: The bases for the read. If None is given a random sequence is generated.
            quals: The list of qualities for the read. If None, the default base quality is used.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            start: The start position of the read. Defaults to the unmapped value.
            cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
            strand: The strand for R1, either "+" or "-". Defaults to "+".
            secondary: If true the read will be flagged as secondary
            supplementary: If true the read will be flagged as supplementary
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if strand field is not "+" or "-"
            ValueError: if read_num is not None, 1 or 2
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            AlignedSegment: The record created
        """

        if strand not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand}")
        if read_num not in [None, 1, 2]:
            raise ValueError(f"Invalid value for read_num: {read_num}")

        name = name if name is not None else self._next_name()

        # Setup the read
        read_len = self.r1_len if read_num != 2 else self.r2_len
        rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
        self._set_flags(
            rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
        )
        self._set_length_dependent_fields(
            rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
        )

        self._records.append(rec)
        return rec

    def to_path(  # noqa: C901
        self,
        path: Optional[Path] = None,
        index: bool = True,
        pred: Callable[[AlignedSegment], bool] = lambda r: True,
        tmp_file_type: Optional[sam.SamFileType] = None,
    ) -> Path:
        """Write the accumulated records to a file, sorts & indexes it, and returns the Path.
        If a path is provided, it will be written to, otherwise a temporary file is created
        and returned.

        If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
        (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
        is provided.  See `~pysam` for more details.

        If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
        provided.

        Args:
            path: a path at which to write the file, otherwise a temp file is used.
            index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
                   an index is generated, otherwise not.
            pred: optional predicate to specify which reads should be output
            tmp_file_type: the file type to output when a path is not provided (default is BAM)

        Returns:
            Path: The path to the sorted (and possibly indexed) file.
        """
        if path is not None:
            # Get the file type if a path was given (in this case, a file type may not be
            # provided too)
            if tmp_file_type is not None:
                raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
            tmp_file_type = sam.SamFileType.from_path(path)
        elif tmp_file_type is None:
            # Use the provided file type
            tmp_file_type = sam.SamFileType.BAM

        # Get the extension, and create a path if none was given
        ext = tmp_file_type.extension
        if path is None:
            with NamedTemporaryFile(suffix=ext, delete=False) as fp:
                path = Path(fp.name)

        with NamedTemporaryFile(suffix=ext, delete=True) as fp:
            file_handle: IO
            if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
                file_handle = path.open("w")
            else:
                file_handle = fp.file

            with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
                for rec in self._records:
                    if pred(rec):
                        writer.write(rec)

            samtools_sort_args = ["-o", str(path), fp.name]

            file_handle.close()
            if self._sort_order == SamOrder.QueryName:
                pysam.sort("-n", *samtools_sort_args)
            elif self._sort_order == SamOrder.Coordinate:
                if index and tmp_file_type.indexable:
                    samtools_sort_args.insert(0, "--write-index")
                pysam.sort(*samtools_sort_args)

        return path

    def __len__(self) -> int:
        """Returns the number of records accumulated so far."""
        return len(self._records)

    def to_unsorted_list(self) -> List[pysam.AlignedSegment]:
        """Returns the accumulated records in the order they were created."""
        return list(self._records)

    def to_sorted_list(self) -> List[pysam.AlignedSegment]:
        """Returns the accumulated records in coordinate order."""
        with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
            filename = fp.name
            path = self.to_path(path=Path(filename), index=False)
            bam = sam.reader(path)
            return list(bam)

    @property
    def header(self) -> AlignmentHeader:
        """Returns the builder's SAM header."""
        return self._samheader
Attributes
header property
header: AlignmentHeader

Returns the builder's SAM header.

Functions
__init__
__init__(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: SamOrder = Coordinate) -> None

Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

Parameters:

Name Type Description Default
r1_len Optional[int]

The length of R1s to create unless otherwise specified

None
r2_len Optional[int]

The length of R2s to create unless otherwise specified

None
base_quality int

The base quality of bases to create unless otherwise specified

30
sd Optional[List[Dict[str, Any]]]

a sequence dictionary as a list of dicts; defaults to calling default_sd() if None

None
rg Optional[Dict[str, str]]

a single read group as a dict; defaults to calling default_sd() if None

None
extra_header Optional[Dict[str, Any]]

a dictionary of extra values to add to the header, None otherwise. See pysam.AlignmentHeader for more details.

None
seed int

a seed value for random number/string generation

42
sort_order SamOrder

Order to sort records when writing to file, or output of to_sorted_list()

Coordinate
Source code in fgpyo/sam/builder.py
def __init__(
    self,
    r1_len: Optional[int] = None,
    r2_len: Optional[int] = None,
    base_quality: int = 30,
    mapping_quality: int = 60,
    sd: Optional[List[Dict[str, Any]]] = None,
    rg: Optional[Dict[str, str]] = None,
    extra_header: Optional[Dict[str, Any]] = None,
    seed: int = 42,
    sort_order: SamOrder = SamOrder.Coordinate,
) -> None:
    """Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

    Args:
        r1_len: The length of R1s to create unless otherwise specified
        r2_len: The length of R2s to create unless otherwise specified
        base_quality: The base quality of bases to create unless otherwise specified
        sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
        rg: a single read group as a dict; defaults to calling default_sd() if None
        extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                      `pysam.AlignmentHeader` for more details.
        seed: a seed value for random number/string generation
        sort_order: Order to sort records when writing to file, or output of to_sorted_list()
    """

    self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
    self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
    self.base_quality: int = base_quality
    self.mapping_quality: int = mapping_quality

    if not isinstance(sort_order, SamOrder):
        raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
    self._sort_order = sort_order

    self._header: Dict[str, Any] = {
        "HD": {"VN": "1.5", "SO": sort_order.value},
        "SQ": (sd if sd is not None else SamBuilder.default_sd()),
        "RG": [(rg if rg is not None else SamBuilder.default_rg())],
    }
    if extra_header is not None:
        self._header = {**self._header, **extra_header}
    self._samheader = AlignmentHeader.from_dict(self._header)
    self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

    self._random: Random = Random(seed)
    self._records: List[AlignedSegment] = []
    self._counter: int = 0
__len__
__len__() -> int

Returns the number of records accumulated so far.

Source code in fgpyo/sam/builder.py
def __len__(self) -> int:
    """Returns the number of records accumulated so far."""
    return len(self._records)
add_pair
add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = NO_REF_POS, start2: int = NO_REF_POS, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) -> Tuple[AlignedSegment, AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:

- `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
  the same contig (`chrom`).
- `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
  map to different contigs (`chrom1` and `chrom2`).

A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.

- `add_pair(chrom, start1)`
- `add_pair(chrom1, start1)`
- `add_pair(chrom, start2)`
- `add_pair(chrom2, start2)`

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name Type Description Default
name Optional[str]

The name of the template. If None is given a unique name will be auto-generated.

None
bases1 Optional[str]

The bases for R1. If None is given a random sequence is generated.

None
bases2 Optional[str]

The bases for R2. If None is given a random sequence is generated.

None
quals1 Optional[List[int]]

The list of int qualities for R1. If None, the default base quality is used.

None
quals2 Optional[List[int]]

The list of int qualities for R2. If None, the default base quality is used.

None
chrom Optional[str]

The chromosome to which both reads are mapped. Defaults to the unmapped value.

None
chrom1 Optional[str]

The chromosome to which R1 is mapped. If None, chrom is used.

None
chrom2 Optional[str]

The chromosome to which R2 is mapped. If None, chrom is used.

None
start1 int

The start position of R1. Defaults to the unmapped value.

NO_REF_POS
start2 int

The start position of R2. Defaults to the unmapped value.

NO_REF_POS
cigar1 Optional[str]

The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

None
cigar2 Optional[str]

The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.

None
mapq1 Optional[int]

Mapping quality for R1. Defaults to self.mapping_quality if None.

None
mapq2 Optional[int]

Mapping quality for R2. Defaults to self.mapping_quality if None.

None
strand1 str

The strand for R1, either "+" or "-". Defaults to "+".

'+'
strand2 str

The strand for R2, either "+" or "-". Defaults to "-".

'-'
attrs Optional[Dict[str, Any]]

An optional dictionary of SAM attribute to place on both R1 and R2.

None

Raises:

Type Description
ValueError

if either strand field is not "+" or "-"

ValueError

if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Type Description
Tuple[AlignedSegment, AlignedSegment]

Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.

Source code in fgpyo/sam/builder.py
def add_pair(
    self,
    *,
    name: Optional[str] = None,
    bases1: Optional[str] = None,
    bases2: Optional[str] = None,
    quals1: Optional[List[int]] = None,
    quals2: Optional[List[int]] = None,
    chrom: Optional[str] = None,
    chrom1: Optional[str] = None,
    chrom2: Optional[str] = None,
    start1: int = sam.NO_REF_POS,
    start2: int = sam.NO_REF_POS,
    cigar1: Optional[str] = None,
    cigar2: Optional[str] = None,
    mapq1: Optional[int] = None,
    mapq2: Optional[int] = None,
    strand1: str = "+",
    strand2: str = "-",
    attrs: Optional[Dict[str, Any]] = None,
) -> Tuple[AlignedSegment, AlignedSegment]:
    """Generates a new pair of reads, adds them to the internal collection, and returns them.

    Most fields are optional.

    Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
    pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
    where reads map to different contigs. i.e.:

        - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
          the same contig (`chrom`).
        - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
          map to different contigs (`chrom1` and `chrom2`).

    A pair with only one of the two reads mapped can be created by setting only one start
    position. Flags will automatically be set correctly for the unmapped mate.

        - `add_pair(chrom, start1)`
        - `add_pair(chrom1, start1)`
        - `add_pair(chrom, start2)`
        - `add_pair(chrom2, start2)`

    An unmapped pair can be created by calling the method with no parameters (specifically,
    not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
    provided, it will be ignored.

    For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
    or absence of bases, quals, and cigar.  If values are provided for one or more of these
    parameters, the lengths must match, and the length will be used to generate any
    unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
    synthesized based on either the r1_len or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        bases1: The bases for R1. If None is given a random sequence is generated.
        bases2: The bases for R2. If None is given a random sequence is generated.
        quals1: The list of int qualities for R1. If None, the default base quality is used.
        quals2: The list of int qualities for R2. If None, the default base quality is used.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
        chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
        start1: The start position of R1. Defaults to the unmapped value.
        start2: The start position of R2. Defaults to the unmapped value.
        cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
        mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
        mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
        strand1: The strand for R1, either "+" or "-". Defaults to "+".
        strand2: The strand for R2, either "+" or "-". Defaults to "-".
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if either strand field is not "+" or "-"
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
    """

    if strand1 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand1}")
    if strand2 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand2: {strand2}")

    name = name if name is not None else self._next_name()

    # Valid parameterizations for contig mapping (backward compatible):
    # - chrom, start1, start2
    # - chrom, start1
    # - chrom, start2
    # Valid parameterizations for contig mapping (new):
    # - chrom1, start1, chrom2, start2
    # - chrom1, start1
    # - chrom2, start2
    if chrom is not None and (chrom1 is not None or chrom2 is not None):
        raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

    chrom = sam.NO_REF_NAME if chrom is None else chrom

    if start1 != sam.NO_REF_POS:
        chrom1 = next(c for c in (chrom1, chrom) if c is not None)
    else:
        chrom1 = sam.NO_REF_NAME

    if start2 != sam.NO_REF_POS:
        chrom2 = next(c for c in (chrom2, chrom) if c is not None)
    else:
        chrom2 = sam.NO_REF_NAME

    if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
        raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

    if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
        raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

    # Setup R1
    r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
    self._set_flags(r1, read_num=1, strand=strand1)
    self._set_length_dependent_fields(
        rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
    )

    # Setup R2
    r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
    self._set_flags(r2, read_num=2, strand=strand2)
    self._set_length_dependent_fields(
        rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
    )

    # Sync up mate info and we're done!
    sam.set_mate_info(r1, r2)
    self._records.append(r1)
    self._records.append(r2)
    return r1, r2
add_single
add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = NO_REF_NAME, start: int = NO_REF_POS, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) -> AlignedSegment

Generates a new single reads, adds them to the internal collection, and returns it.

Most fields are optional.

If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it's paired flag set and read number flags set.

An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.

A mapped read is created by providing chrom and start.

The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name Type Description Default
name Optional[str]

The name of the template. If None is given a unique name will be auto-generated.

None
read_num Optional[int]

Either None, 1 for R1 or 2 for R2

None
bases Optional[str]

The bases for the read. If None is given a random sequence is generated.

None
quals Optional[List[int]]

The list of qualities for the read. If None, the default base quality is used.

None
chrom str

The chromosome to which both reads are mapped. Defaults to the unmapped value.

NO_REF_NAME
start int

The start position of the read. Defaults to the unmapped value.

NO_REF_POS
cigar Optional[str]

The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

None
mapq Optional[int]

Mapping quality for the read. Default to self.mapping_quality if not given.

None
strand str

The strand for R1, either "+" or "-". Defaults to "+".

'+'
secondary bool

If true the read will be flagged as secondary

False
supplementary bool

If true the read will be flagged as supplementary

False
attrs Optional[Dict[str, Any]]

An optional dictionary of SAM attribute to place on both R1 and R2.

None

Raises:

Type Description
ValueError

if strand field is not "+" or "-"

ValueError

if read_num is not None, 1 or 2

ValueError

if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Name Type Description
AlignedSegment AlignedSegment

The record created

Source code in fgpyo/sam/builder.py
def add_single(
    self,
    *,
    name: Optional[str] = None,
    read_num: Optional[int] = None,
    bases: Optional[str] = None,
    quals: Optional[List[int]] = None,
    chrom: str = sam.NO_REF_NAME,
    start: int = sam.NO_REF_POS,
    cigar: Optional[str] = None,
    mapq: Optional[int] = None,
    strand: str = "+",
    secondary: bool = False,
    supplementary: bool = False,
    attrs: Optional[Dict[str, Any]] = None,
) -> AlignedSegment:
    """Generates a new single reads, adds them to the internal collection, and returns it.

    Most fields are optional.

    If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
    set to 1 or 2, the read will have it's paired flag set and read number flags set.

    An unmapped read can be created by calling the method with no parameters (specifically,
    not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

    A mapped read is created by providing chrom and start.

    The length of the read is determined based on the presence or absence of bases, quals,
    and cigar.  If values are provided for one or more of these parameters, the lengths must
    match, and the length will be used to generate any unsupplied values.  If none of bases,
    quals, and cigar are provided, all three will be synthesized based on either the r1_len
    or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        read_num: Either None, 1 for R1 or 2 for R2
        bases: The bases for the read. If None is given a random sequence is generated.
        quals: The list of qualities for the read. If None, the default base quality is used.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        start: The start position of the read. Defaults to the unmapped value.
        cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
        strand: The strand for R1, either "+" or "-". Defaults to "+".
        secondary: If true the read will be flagged as secondary
        supplementary: If true the read will be flagged as supplementary
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if strand field is not "+" or "-"
        ValueError: if read_num is not None, 1 or 2
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        AlignedSegment: The record created
    """

    if strand not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand}")
    if read_num not in [None, 1, 2]:
        raise ValueError(f"Invalid value for read_num: {read_num}")

    name = name if name is not None else self._next_name()

    # Setup the read
    read_len = self.r1_len if read_num != 2 else self.r2_len
    rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
    self._set_flags(
        rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
    )
    self._set_length_dependent_fields(
        rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
    )

    self._records.append(rec)
    return rec
default_rg staticmethod
default_rg() -> Dict[str, str]

Returns the default read group used by the SamBuilder, as a dictionary.

Source code in fgpyo/sam/builder.py
@staticmethod
def default_rg() -> Dict[str, str]:
    """Returns the default read group used by the SamBuilder, as a dictionary."""
    return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}
default_sd staticmethod
default_sd() -> List[Dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns:

Type Description
List[Dict[str, Any]]

A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

Source code in fgpyo/sam/builder.py
@staticmethod
def default_sd() -> List[Dict[str, Any]]:
    """Generates the sequence dictionary that is used by default by SamBuilder.

    Matches the names and lengths of the HG19 reference in use in production.

    Returns:
        A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
    """
    return [
        {"SN": "chr1", "LN": 249250621},
        {"SN": "chr2", "LN": 243199373},
        {"SN": "chr3", "LN": 198022430},
        {"SN": "chr4", "LN": 191154276},
        {"SN": "chr5", "LN": 180915260},
        {"SN": "chr6", "LN": 171115067},
        {"SN": "chr7", "LN": 159138663},
        {"SN": "chr8", "LN": 146364022},
        {"SN": "chr9", "LN": 141213431},
        {"SN": "chr10", "LN": 135534747},
        {"SN": "chr11", "LN": 135006516},
        {"SN": "chr12", "LN": 133851895},
        {"SN": "chr13", "LN": 115169878},
        {"SN": "chr14", "LN": 107349540},
        {"SN": "chr15", "LN": 102531392},
        {"SN": "chr16", "LN": 90354753},
        {"SN": "chr17", "LN": 81195210},
        {"SN": "chr18", "LN": 78077248},
        {"SN": "chr19", "LN": 59128983},
        {"SN": "chr20", "LN": 63025520},
        {"SN": "chr21", "LN": 48129895},
        {"SN": "chr22", "LN": 51304566},
        {"SN": "chrX", "LN": 155270560},
        {"SN": "chrY", "LN": 59373566},
        {"SN": "chrM", "LN": 16571},
    ]
rg
rg() -> Dict[str, Any]

Returns the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py
def rg(self) -> Dict[str, Any]:
    """Returns the single read group that is defined in the header."""
    # The `RG` field contains a list of read group mappings
    # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
    rgs = cast(List[Dict[str, Any]], self._header["RG"])
    assert len(rgs) == 1, "Header did not contain exactly one read group!"
    return rgs[0]
rg_id
rg_id() -> str

Returns the ID of the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py
def rg_id(self) -> str:
    """Returns the ID of the single read group that is defined in the header."""
    # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
    # field is always a string.
    return cast(str, self.rg()["ID"])
to_path
to_path(path: Optional[Path] = None, index: bool = True, pred: Callable[[AlignedSegment], bool] = lambda r: True, tmp_file_type: Optional[SamFileType] = None) -> Path

Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.

If path is provided, tmp_file_type may not be provided. In this case, the file type (SAM/BAM/CRAM) will be automatically determined by the file extension when a path is provided. See ~pysam for more details.

If path is not provided, the file type will default to BAM unless tmp_file_type is provided.

Parameters:

Name Type Description Default
path Optional[Path]

a path at which to write the file, otherwise a temp file is used.

None
index bool

if True and sort_order is Coordinate and output is a BAM/CRAM file, then an index is generated, otherwise not.

True
pred Callable[[AlignedSegment], bool]

optional predicate to specify which reads should be output

lambda r: True
tmp_file_type Optional[SamFileType]

the file type to output when a path is not provided (default is BAM)

None

Returns:

Name Type Description
Path Path

The path to the sorted (and possibly indexed) file.

Source code in fgpyo/sam/builder.py
def to_path(  # noqa: C901
    self,
    path: Optional[Path] = None,
    index: bool = True,
    pred: Callable[[AlignedSegment], bool] = lambda r: True,
    tmp_file_type: Optional[sam.SamFileType] = None,
) -> Path:
    """Write the accumulated records to a file, sorts & indexes it, and returns the Path.
    If a path is provided, it will be written to, otherwise a temporary file is created
    and returned.

    If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
    (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
    is provided.  See `~pysam` for more details.

    If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
    provided.

    Args:
        path: a path at which to write the file, otherwise a temp file is used.
        index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
               an index is generated, otherwise not.
        pred: optional predicate to specify which reads should be output
        tmp_file_type: the file type to output when a path is not provided (default is BAM)

    Returns:
        Path: The path to the sorted (and possibly indexed) file.
    """
    if path is not None:
        # Get the file type if a path was given (in this case, a file type may not be
        # provided too)
        if tmp_file_type is not None:
            raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
        tmp_file_type = sam.SamFileType.from_path(path)
    elif tmp_file_type is None:
        # Use the provided file type
        tmp_file_type = sam.SamFileType.BAM

    # Get the extension, and create a path if none was given
    ext = tmp_file_type.extension
    if path is None:
        with NamedTemporaryFile(suffix=ext, delete=False) as fp:
            path = Path(fp.name)

    with NamedTemporaryFile(suffix=ext, delete=True) as fp:
        file_handle: IO
        if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
            file_handle = path.open("w")
        else:
            file_handle = fp.file

        with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
            for rec in self._records:
                if pred(rec):
                    writer.write(rec)

        samtools_sort_args = ["-o", str(path), fp.name]

        file_handle.close()
        if self._sort_order == SamOrder.QueryName:
            pysam.sort("-n", *samtools_sort_args)
        elif self._sort_order == SamOrder.Coordinate:
            if index and tmp_file_type.indexable:
                samtools_sort_args.insert(0, "--write-index")
            pysam.sort(*samtools_sort_args)

    return path
to_sorted_list
to_sorted_list() -> List[AlignedSegment]

Returns the accumulated records in coordinate order.

Source code in fgpyo/sam/builder.py
def to_sorted_list(self) -> List[pysam.AlignedSegment]:
    """Returns the accumulated records in coordinate order."""
    with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
        filename = fp.name
        path = self.to_path(path=Path(filename), index=False)
        bam = sam.reader(path)
        return list(bam)
to_unsorted_list
to_unsorted_list() -> List[AlignedSegment]

Returns the accumulated records in the order they were created.

Source code in fgpyo/sam/builder.py
def to_unsorted_list(self) -> List[pysam.AlignedSegment]:
    """Returns the accumulated records in the order they were created."""
    return list(self._records)

clipping

Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
...     for rec in fh:
...         before = rec.cigarstring
...         info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         after = rec.cigarstring
...         print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

  • Mate pairs may have incorrect information about their mate's positions
  • Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd

Classes

ClippingInfo

Bases: NamedTuple

Named tuple holding the number of bases clipped on the query and reference respectively.

Source code in fgpyo/sam/clipping.py
class ClippingInfo(NamedTuple):
    """Named tuple holding the number of bases clipped on the query and reference respectively."""

    query_bases_clipped: int
    """The number of query bases in the alignment that were clipped."""

    ref_bases_clipped: int
    """The number of reference bases in the alignment that were clipped."""
Attributes
query_bases_clipped instance-attribute
query_bases_clipped: int

The number of query bases in the alignment that were clipped.

ref_bases_clipped instance-attribute
ref_bases_clipped: int

The number of reference bases in the alignment that were clipped.

Functions

softclip_end_of_alignment_by_query
softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the end of a read's alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the end of a read's alignment.

    Clipping is applied before any existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    # Reverse the cigar and qualities so we can clip from the start
    cigar = Cigar.from_cigartuples(rec.cigartuples).reversed()
    quals = rec.query_qualities
    quals.reverse()
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)

    # Then reverse everything back again
    quals.reverse()
    rec.query_qualities = quals
    rec.cigarstring = str(new_cigar.reversed())

    _cleanup(rec, tags_to_invalidate)
    return clipping_info
softclip_end_of_alignment_by_ref
softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the end of an alignment by bases_to_clip bases on the reference.

    Clipping is applied beforeany existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_end = rec.reference_end - bases_to_clip
    new_query_end = _read_pos_at_ref_pos(rec, new_end, previous=False)
    query_bases_to_clip = rec.query_alignment_end - new_query_end
    return softclip_end_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )
softclip_start_of_alignment_by_query
softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the start of a read's alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the start of a read's alignment.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    cigar = Cigar.from_cigartuples(rec.cigartuples)
    quals = rec.query_qualities
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)
    rec.query_qualities = quals

    rec.reference_start += clipping_info.ref_bases_clipped
    rec.cigarstring = str(new_cigar)
    _cleanup(rec, tags_to_invalidate)
    return clipping_info
softclip_start_of_alignment_by_ref
softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the start of an alignment by bases_to_clip bases on the reference.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_start = rec.reference_start + bases_to_clip
    new_query_start = _read_pos_at_ref_pos(rec, new_start, previous=False)
    query_bases_to_clip = new_query_start - rec.query_alignment_start
    return softclip_start_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )