Skip to content

clipping

Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
...     for rec in fh:
...         before = rec.cigarstring
...         info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         after = rec.cigarstring
...         print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

  • Mate pairs may have incorrect information about their mate's positions
  • Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd

Classes

ClippingInfo

Bases: NamedTuple

Named tuple holding the number of bases clipped on the query and reference respectively.

Source code in fgpyo/sam/clipping.py
class ClippingInfo(NamedTuple):
    """Named tuple holding the number of bases clipped on the query and reference respectively."""

    query_bases_clipped: int
    """The number of query bases in the alignment that were clipped."""

    ref_bases_clipped: int
    """The number of reference bases in the alignment that were clipped."""

Attributes

query_bases_clipped instance-attribute
query_bases_clipped: int

The number of query bases in the alignment that were clipped.

ref_bases_clipped instance-attribute
ref_bases_clipped: int

The number of reference bases in the alignment that were clipped.

Functions

softclip_end_of_alignment_by_query

softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the end of a read's alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the end of a read's alignment.

    Clipping is applied before any existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    # Reverse the cigar and qualities so we can clip from the start
    cigar = Cigar.from_cigartuples(rec.cigartuples).reversed()
    quals = rec.query_qualities
    quals.reverse()
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)

    # Then reverse everything back again
    quals.reverse()
    rec.query_qualities = quals
    rec.cigarstring = str(new_cigar.reversed())

    _cleanup(rec, tags_to_invalidate)
    return clipping_info

softclip_end_of_alignment_by_ref

softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_end_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the end of an alignment by bases_to_clip bases on the reference.

    Clipping is applied beforeany existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_end = rec.reference_end - bases_to_clip
    new_query_end = _read_pos_at_ref_pos(rec, new_end, previous=False)
    query_bases_to_clip = rec.query_alignment_end - new_query_end
    return softclip_end_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )

softclip_start_of_alignment_by_query

softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the start of a read's alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired in the read/query

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the start of a read's alignment.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    cigar = Cigar.from_cigartuples(rec.cigartuples)
    quals = rec.query_qualities
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)
    rec.query_qualities = quals

    rec.reference_start += clipping_info.ref_bases_clipped
    rec.cigarstring = str(new_cigar)
    _cleanup(rec, tags_to_invalidate)
    return clipping_info

softclip_start_of_alignment_by_ref

softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name Type Description Default
rec AlignedSegment

the BAM record to clip

required
bases_to_clip int

the number of additional bases of clipping desired on the reference

required
clipped_base_quality Optional[int]

if not None, set bases in the clipped region to this quality

None
tags_to_invalidate Iterable[str]

the set of extended attributes to remove upon clipping

TAGS_TO_INVALIDATE

Returns:

Name Type Description
ClippingInfo ClippingInfo

a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py
def softclip_start_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: Optional[int] = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """Soft-clips the start of an alignment by bases_to_clip bases on the reference.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_start = rec.reference_start + bases_to_clip
    new_query_start = _read_pos_at_ref_pos(rec, new_start, previous=False)
    query_bases_to_clip = new_query_start - rec.query_alignment_start
    return softclip_start_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )