clipping
Utility Functions for Soft-Clipping records in SAM/BAM Files¶
This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:
softclip_start_of_alignment_by_query()clips the start of the alignment in terms of query basessoftclip_end_of_alignment_by_query()clips the end of the alignment in terms of query basessoftclip_start_of_alignment_by_ref()clips the start of the alignment in terms of reference basessoftclip_end_of_alignment_by_ref()clips the end of the alignment in terms of reference bases
The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.
Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.
For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:
>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
... for rec in fh:
... before = rec.cigarstring
... info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
... after = rec.cigarstring
... print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)
It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:
- Mate pairs may have incorrect information about their mate's positions
- Even if the input was coordinate sorted, the output may be out of order
To rectify these problems it is necessary to do the equivalent of:
Classes¶
ClippingInfo ¶
Bases: NamedTuple
Named tuple holding the number of bases clipped on the query and reference respectively.
Source code in fgpyo/sam/clipping.py
Functions¶
softclip_end_of_alignment_by_query ¶
softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the end of a read's alignment.
Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_end_of_alignment_by_ref ¶
softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the end of an alignment by bases_to_clip bases on the reference.
Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_query ¶
softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the start of a read's alignment.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_ref ¶
softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the start of an alignment by bases_to_clip bases on the reference.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |