fgpyo
Classes¶
RequirementError ¶
Functions¶
require ¶
Require a condition be satisfied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condition
|
bool
|
The condition to satisfy. |
required |
message
|
Union[str, Callable[[], str], None]
|
An optional message to include with the error when the condition is false. The message may be provided as either a string literal or a function returning a string. The function will not be evaluated unless the condition is false. |
None
|
Raises:
| Type | Description |
|---|---|
RequirementError
|
If the condition is false. |
Source code in fgpyo/_requirements.py
Modules¶
collections ¶
Custom Collections and Collection Functions¶
This module contains classes and functions for working with collections and iterators.
Helpful Functions for Working with Collections¶
To test if an iterable is sorted or not:
>>> from fgpyo.collections import is_sorted
>>> is_sorted([])
True
>>> is_sorted([1])
True
>>> is_sorted([1, 2, 2, 3])
True
>>> is_sorted([1, 2, 4, 3])
False
Examples of a "Peekable" Iterator¶
"Peekable" iterators are useful to "peek" at the next item in an iterator without consuming it.
For example, this is useful when consuming items in iterator while a predicate is true, and not
consuming the first element where the element is not true. See the
takewhile() and
dropwhile() methods.
An empty peekable iterator throws a
StopIteration:
>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
Traceback (most recent call last):
...
StopIteration
A peekable iterator will return the next item before consuming it.
>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]
The can_peek() function can be used to determine if
the iterator can be peeked without a
StopIteration from being
thrown:
>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
Traceback (most recent call last):
...
StopIteration
PeekableIterator's constructor supports creation from
iterable objects as well as iterators.
Attributes¶
LessThanOrEqualType
module-attribute
¶
LessThanOrEqualType = TypeVar('LessThanOrEqualType', bound=SupportsLessThanOrEqual)
A type variable for an object that supports less-than-or-equal comparisons.
Classes¶
PeekableIterator ¶
Bases: Generic[IterType], Iterator[IterType]
A peekable iterator wrapping an iterator or iterable.
This allows returning the next item without consuming it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Union[Iterator[IterType], Iterable[IterType]]
|
an iterator over the objects |
required |
Source code in fgpyo/collections/__init__.py
Functions¶
dropwhile(pred: Callable[[IterType], bool]) -> PeekableIterator[IterType]
Drops elements from the iterator while the predicate is true.
Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
Callable[[V], bool]
|
a function that takes a value from the iterator and returns true or false. |
required |
Returns:
| Type | Description |
|---|---|
PeekableIterator[IterType]
|
PeekableIterator[V]: a reference to this iterator, so calls can be chained |
Source code in fgpyo/collections/__init__.py
Returns the next element without consuming it, or StopIteration otherwise.
Consumes from the iterator while pred is true, and returns the result as a List.
The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
Callable[[IterType], bool]
|
a function that takes the next value from the iterator and returns true or false. |
required |
Returns:
| Type | Description |
|---|---|
List[IterType]
|
List[V]: A list of the values from the iterator, in order, up until and excluding |
List[IterType]
|
the first value that does not match the predicate. |
Source code in fgpyo/collections/__init__.py
SupportsLessThanOrEqual ¶
Bases: Protocol
A structural type for objects that support less-than-or-equal comparison.
Source code in fgpyo/collections/__init__.py
Functions¶
is_sorted ¶
is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool
Tests lazily if an iterable of comparable objects is sorted or not.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iterable
|
Iterable[LessThanOrEqualType]
|
An iterable of comparable objects. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If there is more than 1 element in |
Source code in fgpyo/collections/__init__.py
fasta ¶
Modules¶
builder ¶
Classes for generating fasta files and records for testing¶
This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).
Examples of creating sets of contigs for writing to fasta¶
Writing a FASTA with two contigs each with 100 bases:
>>> from pathlib import Path
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder = builder.add("chr11").add("GGGGGGGGGG", 10)
>>> fasta_path = Path(getfixture("tmp_path")) / "test.fasta"
>>> builder.to_file(path=fasta_path)
Writing a FASTA with one contig with 100 A's and 50 T's:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder.to_file(path=fasta_path)
Add bases to existing contig:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> contig_one.bases
'AAAAAAAAAANNN'
Classes¶
Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Unique contig ID, ie., "chr10" |
|
assembly |
Assembly information, if None default is 'testassembly' |
|
species |
Species information, if None default is 'testspecies' |
|
bases |
The bases to be added to the contig ex "A" |
Source code in fgpyo/fasta/builder.py
add(bases: str, times: int = 1) -> ContigBuilder
Method for adding bases to a new or existing instance of ContigBuilder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
The bases to be added to the contig |
required |
times
|
int
|
The number of times the bases should be repeated |
1
|
Example add("AAA", 2) results in the following bases -> "AAAAAA"
Source code in fgpyo/fasta/builder.py
Builder for constructing sets of one or more contigs.
Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.
A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.
Contigs are added to FastaBuilder using:
add()
Bases are added to existing contigs using:
add()
Once accumulated the contigs can be written to a file using:
to_file()
Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).
Attributes:
| Name | Type | Description |
|---|---|---|
assembly |
str
|
Assembly information, if None default is 'testassembly' |
species |
str
|
Species, if None default is 'testspecies' |
line_length |
int
|
Desired line length, if None default is 80 |
contig_builders |
int
|
Private dictionary of contig names and instances of ContigBuilder |
Source code in fgpyo/fasta/builder.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | |
__getitem__(key: str) -> ContigBuilder
add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) -> ContigBuilder
Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique contig ID, ie., "chr10" |
required |
assembly
|
Optional[str]
|
Assembly information, if None default is 'testassembly' |
None
|
species
|
Optional[str]
|
Species information, if None default is 'testspecies' |
None
|
Source code in fgpyo/fasta/builder.py
Writes out the set of accumulated contigs to a FASTA file at the path given.
Also generates the accompanying fasta index file (.fa.fai) and sequence
dictionary file (.dict).
Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write files to. |
required |
Example: FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
Source code in fgpyo/fasta/builder.py
Functions¶
Calls pysam.dict and writes the sequence dictionary to the provided output path
Args assembly: Assembly species: Species output_path: File path to write dictionary to input_path: Path to fasta file
Source code in fgpyo/fasta/builder.py
Calls pysam.faidx and writes fasta index in the same file location as the fasta file
Args input_path: Path to fasta file
sequence_dictionary ¶
Classes for representing sequencing dictionaries.¶
Examples of building and using sequence dictionaries¶
Building a sequence dictionary from a pysam.AlignmentHeader:
>>> import pysam
>>> from fgpyo.fasta.sequence_dictionary import SequenceDictionary
>>> sd: SequenceDictionary
>>> with pysam.AlignmentFile("./tests/fgpyo/sam/data/valid.sam") as fh:
... sd = SequenceDictionary.from_sam(fh.header)
>>> print(sd)
@SQ SN:chr1 LN:101
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Query based on index:
Query based on name:
Add, get, and delete attributes:
>>> from fgpyo.fasta.sequence_dictionary import Keys
>>> meta = sd[0]
>>> print(meta)
@SQ SN:chr1 LN:101
>>> meta[Keys.ASSEMBLY] = "hg38"
>>> print(meta)
@SQ SN:chr1 LN:101 AS:hg38
>>> meta.get(Keys.ASSEMBLY)
'hg38'
>>> meta.get(Keys.SPECIES) is None
True
>>> Keys.MD5 in meta
False
>>> del meta[Keys.ASSEMBLY]
>>> print(meta)
@SQ SN:chr1 LN:101
Get a sequence based on one of its aliases
>>> meta[Keys.ALIASES] = "foo,bar,car"
>>> sd = SequenceDictionary(infos=[meta] + sd.infos[1:])
>>> print(sd)
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
>>> print(sd["chr1"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
>>> print(sd["bar"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
Create a pysam.AlignmentHeader from a sequence dictionary:
>>> sd.to_sam_header()
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header())
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Create a pysam.AlignmentHeader from a sequence dictionary with extra header items:
>>> sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... )
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... ))
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
@RG ID:A LB:a-library
@RG ID:B LB:b-library
Attributes¶
module-attribute
¶SEQUENCE_NAME_PATTERN: Pattern = compile('^[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*$')
Regular expression for valid reference sequence names according to the SAM spec
Classes¶
dataclass
¶Stores an alternate locus for an associated sequence (1-based inclusive)
Source code in fgpyo/fasta/sequence_dictionary.py
Any post initialization validation should go here
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶parse(value: str) -> AlternateLocus
Parse the genomic interval of format: <contig>:<start>-<end>
Source code in fgpyo/fasta/sequence_dictionary.py
Bases: StrEnum
Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line).
Source code in fgpyo/fasta/sequence_dictionary.py
dataclass
¶
Bases: Mapping[Union[str, int], SequenceMetadata]
Contains an ordered collection of sequences.
A specific SequenceMetadata may be retrieved by name (str) or index (int), either by
using the generic get method or by the correspondingly named by_name and by_index methods.
The latter methods provide faster retrieval when the type is known.
This mapping collection iterates over the keys. To iterate over each SequenceMetadata,
either use the typical values() method or access the metadata directly with infos.
Attributes:
| Name | Type | Description |
|---|---|---|
infos |
List[SequenceMetadata]
|
the ordered collection of sequence metadata |
Source code in fgpyo/fasta/sequence_dictionary.py
395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 | |
by_index(index: int) -> SequenceMetadata
Gets a SequenceMetadata explicitly by name. Raises an IndexError
if the index is out of bounds.
by_name(name: str) -> SequenceMetadata
staticmethod
¶from_sam(data: Path) -> SequenceDictionary
from_sam(data: AlignmentFile) -> SequenceDictionary
from_sam(data: AlignmentHeader) -> SequenceDictionary
from_sam(data: List[Dict[str, Any]]) -> SequenceDictionary
from_sam(data: Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]) -> SequenceDictionary
Creates a SequenceDictionary from a SAM file or its header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]
|
The input may be any of:
- a path to a SAM file
- an open |
required |
Returns:
A SequenceDictionary mapping refrence names to their metadata.
Source code in fgpyo/fasta/sequence_dictionary.py
get_by_name(name: str) -> Optional[SequenceMetadata]
Gets a SequenceMetadata explicitly by name. Returns None if
the name does not exist in this dictionary
same_as(other: SequenceDictionary) -> bool
Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence dictionary to a pysam.AlignmentHeader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extra_header
|
Optional[Dict[str, Any]]
|
a dictionary of extra values to add to the header, None otherwise. See
|
None
|
Source code in fgpyo/fasta/sequence_dictionary.py
dataclass
¶
Bases: MutableMapping[Union[Keys, str], str]
Stores information about a single Sequence (ex. chromosome, contig).
Implements the mutable mapping interface, which provides access to the attributes of this
sequence, including name, length, but not index. When using the mapping interface, for example
getting, setting, deleting, as well as iterating over keys, values, and items, the values will
always be strings (str type). For example, the length will be an str when accessing via
get; access the length directly or use len to return an int. Similarly, use the
alias property to return a List[str] of aliases, use the alternate property to return
an AlternativeLocus-typed instance, and topology property to return a Toplogy-typed
instance.
All attributes except name and length may be set. Use dataclasses.replace to create a new
copy in such cases.
Important: The len method returns the length of the sequence, not the length of the
attributes. Use len(meta.attributes) for the latter.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
the primary name of the sequence |
length |
int
|
the length of the sequence, or zero if unknown |
index |
int
|
the index in the sequence dictionary |
attributes |
Dict[Union[Keys, str], str]
|
attributes of this sequence |
Source code in fgpyo/fasta/sequence_dictionary.py
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 | |
property
¶A list of all names, including the primary name and aliases, in that order.
property
¶True if there is an alternate locus defined, False otherwise
Any post initialization validation should go here
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶from_sam(meta: Dict[Union[Keys, str], Any], index: int) -> SequenceMetadata
Builds a SequenceMetadata from a dictionary. The keys must include the sequence
name (Keys.SEQUENCE_NAME) and length (Keys.SEQUENCE_LENGTH). All other keys from
Keys will be stored in the resulting attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta
|
Dict[Union[Keys, str], Any]
|
the python dictionary with keys from |
required |
index
|
int
|
the 0-based index to use for this sequence |
required |
Source code in fgpyo/fasta/sequence_dictionary.py
same_as(other: SequenceMetadata) -> bool
Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s.
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence metadata to a dictionary equivalent to one item in the
list of sequences from pysam.AlignmentHeader#to_dict()["SQ"].
Source code in fgpyo/fasta/sequence_dictionary.py
Modules¶
fastx ¶
Zipping FASTX Files¶
Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics
and can be achieved with the FastxZipped() context manager.
The context manager facilitates opening of all input FASTA/FASTQ files and closing them after
iteration is complete. For every iteration of FastxZipped(),
a tuple of the next FASTX records are returned (of type
pysam.FastxRecord()). An exception will be raised if any of the input
files are malformed or truncated and if record names are not equivalent and in sync.
Importantly, this context manager is optimized for fast streaming read-only usage and, by default,
any previous records saved while advancing the iterator will not be correct as the underlying
pointer in memory will refer to the most recent record only, and not any past records. To preserve
the state of all previously iterated records, set the parameter persist to True.
>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:
... for (r1, r2) in zipped:
... print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT
Classes¶
FastxZipped ¶
Bases: AbstractContextManager, Iterator[Tuple[FastxRecord, ...]]
A context manager that will lazily zip over any number of FASTA/FASTQ files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
Union[Path, str]
|
Paths to the FASTX files to zip over. |
()
|
persist
|
bool
|
Whether to persist the state of previous records during iteration. |
False
|
Source code in fgpyo/fastx/__init__.py
Functions¶
__exit__(exc_type: Optional[Type[BaseException]], exc_val: Optional[BaseException], exc_tb: Optional[TracebackType]) -> Optional[bool]
Exit the FastxZipped context manager by closing all FASTX files.
Source code in fgpyo/fastx/__init__.py
Instantiate a FastxZipped context manager and iterator.
Source code in fgpyo/fastx/__init__.py
Return the next set of FASTX records from the zipped FASTX files.
Source code in fgpyo/fastx/__init__.py
io ¶
Module for reading and writing files¶
The functions in this module make it easy to:
- check if a file exists and is writable
- check if a file and its parent directories exist and are writable
- check if a file exists and is readable
- check if a path exists and is a directory
- open an appropriate reader or writer based on the file extension
- write items to a file, one per line
- read lines from a file
fgpyo.io Examples:¶
>>> import fgpyo.io as fio
>>> from fgpyo.io import write_lines, read_lines
>>> from pathlib import Path
Assert that a path exists and is readable:
>>> tmp_dir = Path(getfixture("tmp_path"))
>>> path_flat: Path = tmp_dir / "example.txt"
>>> fio.assert_path_is_readable(path_flat)
Traceback (most recent call last):
...
AssertionError: Cannot read non-existent path: ...
Write to and read from path:
>>> path_flat = tmp_dir / "example.txt"
>>> path_compressed = tmp_dir / "example.txt.gz"
>>> write_lines(path=path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path=path_compressed, lines_to_write=["gzip file", 10])
Read lines from a path into a generator:
>>> lines = read_lines(path=path_flat)
>>> next(lines)
'flat file'
>>> next(lines)
'10'
>>> lines = read_lines(path=path_compressed)
>>> next(lines)
'gzip file'
>>> next(lines)
'10'
Functions¶
assert_directory_exists ¶
Asserts that a path exist and is a directory
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to check |
required |
Example
assert_directory_exists(path = Path("/example/directory/"))
Source code in fgpyo/io/__init__.py
assert_fasta_indexed ¶
Verify that a FASTA is readable and has the expected index files.
The existence of the FASTA index generated by samtools faidx will always be verified. The
existence of the index files generated by samtools dict and bwa index may be optionally
verified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta
|
Path
|
Path to the FASTA file. |
required |
dictionary
|
bool
|
If True, check for the index file generated by |
False
|
bwa
|
bool
|
If True, check for the index files generated by |
False
|
Raises:
| Type | Description |
|---|---|
AssertionError
|
If the FASTA or any of the expected index files are missing or not readable. |
Source code in fgpyo/io/__init__.py
assert_path_is_readable ¶
Checks that file exists and returns True, else raises AssertionError
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
a Path to check |
required |
Example
assert_file_exists(path = Path("some_file.csv"))
Source code in fgpyo/io/__init__.py
assert_path_is_writable ¶
Assert that a filepath is writable.
Specifically:
- If the file exists then it must also be writable.
- Else if the path is not a file and parent_must_exist is true, then assert that the parent
directory exists and is writable.
- Else if the path is not a directory and parent_must_exist is false, then look at each parent
directory until one is found that exists and is writable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to check |
required |
parent_must_exist
|
bool
|
If True, the file's parent directory must exist. Otherwise, at least one directory in the path's components must exist. |
True
|
Raises:
| Type | Description |
|---|---|
AssertionError
|
If any of the above conditions are not met. |
Example
assert_path_is_writable(path = Path("example.txt"))
Source code in fgpyo/io/__init__.py
assert_path_is_writeable ¶
A deprecated alias for assert_path_is_writable().
Source code in fgpyo/io/__init__.py
read_lines ¶
Takes a path and reads each line into a generator, removing line terminators
along the way. By default, only line terminators (CR/LF) are stripped. The strip
parameter may be used to strip both leading and trailing whitespace from each line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to read from |
required |
strip
|
bool
|
True to strip lines of all leading and trailing whitespace, False to only remove trailing CR/LF characters. |
False
|
threads
|
Optional[int]
|
the number of threads to use when decompressing gzip files |
None
|
Example
import fgpyo.io as fio read_back = fio.read_lines(path)
Source code in fgpyo/io/__init__.py
redirect_to_dev_null ¶
A context manager that redirects output of file handle to /dev/null
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_num
|
int
|
number of filehandle to redirect. |
required |
Source code in fgpyo/io/__init__.py
suppress_stderr ¶
A context manager that redirects output of stderr to /dev/null
to_reader ¶
Opens a Path for reading and based on extension uses open() or gzip_ng.open()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to read from |
required |
threads
|
Optional[int]
|
the number of threads to use when decompressing gzip files |
None
|
Example
import fgpyo.io as fio reader = fio.to_reader(path=Path("reader.txt")).readlines().close()
Source code in fgpyo/io/__init__.py
to_writer ¶
Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write (or append) to |
required |
append
|
bool
|
open the file for appending |
False
|
threads
|
Optional[int]
|
the number of threads to use when compressing gzip files |
None
|
Example
import fgpyo.io as fio writer = fio.to_writer(path=Path("writer.txt")).write("something\n").close()
Source code in fgpyo/io/__init__.py
write_lines ¶
write_lines(path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: Optional[int] = None) -> None
Writes (or appends) a file with one line per item in provided iterable
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write (or append) to |
required |
lines_to_write
|
Iterable[Any]
|
items to write (or append) to file |
required |
append
|
bool
|
open the file for appending |
False
|
threads
|
Optional[int]
|
the number of threads to use when compressing gzip files |
None
|
Example
lines: List[Any] = ["things to write", 100] path_to_write_to: Path = Path("file_to_write_to.txt") fio.write_lines(path = path_to_write_to, lines_to_write = lines)
Source code in fgpyo/io/__init__.py
platform ¶
Modules¶
illumina ¶
Methods for working with Illumina-specific UMIs in SAM files¶
The functions in this module make it easy to:
- check whether a UMI is valid
- extract UMI(s) from an Illumina-style read name
- copy a UMI from an alignment's read name to its
RXSAM tag
Attributes¶
module-attribute
¶Multiple UMI delimiter, which SAM specification recommends should be a hyphen; see specification here: https://samtools.github.io/hts-specs/SAMtags.pdf
Functions¶
copy_umi_from_read_name(rec: AlignedSegment, strict: bool = False, remove_umi: bool = False) -> bool
Copy a UMI from an alignment's read name to its RX SAM tag. UMI will not be copied to RX
tag if invalid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
The alignment record to update. |
required |
strict
|
bool
|
If |
False
|
remove_umi
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the read name does not end with a valid UMI. |
ValueError
|
If the record already has a populated |
Source code in fgpyo/platform/illumina.py
extract_umis_from_read_name(read_name: str, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER, strict: bool = False) -> Optional[str]
Extract UMI(s) from an Illumina-style read name.
The UMI is expected to be the final component of the read name, delimited by the
read_name_delimiter. Multiple UMIs may be present, delimited by the umi_delimiter. This
delimiter will be replaced by the SAM-standard -.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read_name
|
str
|
The read name to extract the UMI from. |
required |
read_name_delimiter
|
str
|
The delimiter separating the components of the read name. |
_ILLUMINA_READ_NAME_DELIMITER
|
umi_delimiter
|
str
|
The delimiter separating multiple UMIs. |
_ILLUMINA_UMI_DELIMITER
|
strict
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are |
Optional[str]
|
returned in a single string, separated by a hyphen ( |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the read name does not end with a valid UMI. |
Source code in fgpyo/platform/illumina.py
read_structure ¶
Classes for representing Read Structures¶
A Read Structure refers to a String that describes how the bases in a sequencing run should be
allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's
bcltofastq software, but provides some additional capabilities.
A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last
segment in the string is allowed to use + instead of a number for its length. The + translates
to whatever bases are left after the other segments are processed and can be thought of as meaning
[0..infinity].
See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
Examples¶
>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
['75T', '8B', '75T']
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
... ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
... print(str(ex))
Read structure missing length information: 23T2T[T]23T
Attributes¶
ANY_LENGTH_CHAR
module-attribute
¶
A character that can be put in place of a number in a read structure to mean "0 or more bases".
Classes¶
ReadSegment ¶
Encapsulates all the information about a segment within a read structure. A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.
Attributes:
| Name | Type | Description |
|---|---|---|
offset |
int
|
The offset of the read segment in the read. |
length |
Optional[int]
|
The length of the segment, or None if it is variable length. |
kind |
SegmentType
|
The kind of read segment. |
Source code in fgpyo/read_structure.py
Attributes¶
property
¶The fixed length if there is one. Throws an exception on segments without fixed lengths!
Functions¶
extract(bases: str) -> SubReadWithoutQuals
Gets the bases associated with this read segment.
extract_with_quals(bases: str, quals: str) -> SubReadWithQuals
Gets the bases and qualities associated with this read segment.
Source code in fgpyo/read_structure.py
ReadStructure ¶
Bases: Iterable[ReadSegment]
Describes the structure of a give read. A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.
Attributes:
| Name | Type | Description |
|---|---|---|
segments |
Tuple[ReadSegment, ...]
|
The segments composing the read structure |
Source code in fgpyo/read_structure.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | |
Attributes¶
property
¶The fixed length if there is one. Throws an exception on segments without fixed lengths!
property
¶True if the ReadStructure has a fixed (i.e. non-variable) length
property
¶Length is defined as the number of segments (not bases!) in the read structure
Functions¶
extract(bases: str) -> Tuple[SubReadWithoutQuals, ...]
Splits the given bases into tuples with its associated read segment.
extract_with_quals(bases: str, quals: str) -> Tuple[SubReadWithQuals, ...]
Splits the given bases and qualities into triples with its associated read segment.
Source code in fgpyo/read_structure.py
classmethod
¶from_segments(segments: Tuple[ReadSegment, ...], reset_offsets: bool = False) -> ReadStructure
Creates a new ReadStructure, optionally resetting the offsets on each of the segments
Source code in fgpyo/read_structure.py
segments_by_kind(kind: SegmentType) -> Tuple[ReadSegment, ...]
with_variable_last_segment() -> ReadStructure
Generates a new ReadStructure that is the same as this one except that the last segment has undefined length
Source code in fgpyo/read_structure.py
SegmentType ¶
Bases: Enum
The type of segments that can show up in a read structure
Source code in fgpyo/read_structure.py
Attributes¶
class-attribute
instance-attribute
¶The segment type for cell barcode bases.
class-attribute
instance-attribute
¶The segment type for molecular barcode bases.
class-attribute
instance-attribute
¶The segment type for sample barcode bases.
class-attribute
instance-attribute
¶The segment type for bases that need to be skipped.
SubReadWithQuals ¶
Contains the bases and qualities that correspond to the given read segment
Source code in fgpyo/read_structure.py
Attributes¶
instance-attribute
¶The sub-read base qualities that correspond to the given read segment.
instance-attribute
¶segment: ReadSegment
The segment of the read structure that describes this sub-read.
SubReadWithoutQuals ¶
Contains the bases that correspond to the given read segment.
Source code in fgpyo/read_structure.py
Attributes¶
instance-attribute
¶segment: ReadSegment
The segment of the read structure that describes this sub-read.
sam ¶
Utility Classes and Methods for SAM/BAM¶
This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.
Motivation for Reader and Writer methods¶
The following are the reasons for choosing to implement methods to open a SAM/BAM file for
reading and writing, rather than relying on pysam.AlignmentFile directly:
- Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
- Makes the requirement to provide a header when opening a file for writing more explicit.
- Adds support for
pathlib.Path. - Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
- An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.
Examples of Opening a SAM/BAM for Reading or Writing¶
Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See
SamFileType() for the supported file types.
>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
... for record in fh:
... print(record.query_name) # do something
>>> with reader("/path/to/sample.bam") as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for reading, explicitly passing the file type.
>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
... for record in fh:
... print(record.query_name) # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for reading, using an existing file-like object
>>> with open("/path/to/sample.sam", "rb") as file_object:
... with reader(path=file_object, file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for writing follows similar to the reader()
method, but the SAM file header object is required.
>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
... "HD": {"VN": "1.5", "SO": "coordinate"},
... "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
... "SQ": [
... {"SN": "chr1", "LN": 249250621},
... {"SN": "chr2", "LN": 243199373}
... ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
... pass # do something
Examples of Manipulating Cigars¶
Creating a Cigar from a pysam.AlignedSegment.
>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
... record = next(fh)
... cigar = Cigar.from_cigartuples(record.cigartuples)
... print(str(cigar))
50M2D5M10S
Creating a Cigar from a str().
If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.
>>> cigar = Cigar.from_cigarstring("10M5U")
Traceback (most recent call last):
...
fgpyo.sam.CigarParsingException: Malformed cigar: 10M5[U]
The cigar contains a tuple of CigarElement()s. Each element
contains the cigar operator (CigarOp()) and associated operator
length. A number of useful methods are part of both classes.
The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]
Any particular element can be accessed directly via .elements with its index (and works with
negative indexes and slices):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> cigar.elements[0].length
50
>>> cigar.elements[1].operator
<CigarOp.D: (2, 'D', False, True)>
>>> cigar.elements[-1].operator
<CigarOp.S: (4, 'S', True, False)>
>>> tuple(x.operator.character for x in cigar.elements[1:3])
('D', 'M')
>>> tuple(x.operator.character for x in cigar.elements[-2:])
('I', 'S')
Examples of parsing the SA tag and individual supplementary alignments¶
>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1'
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']
Attributes¶
DefaultProperlyPairedOrientations
module-attribute
¶
DefaultProperlyPairedOrientations: set[PairOrientation] = {FR}
The default orientations for properly paired reads.
NO_QUERY_QUALITIES
module-attribute
¶
NO_QUERY_QUALITIES: array = qualitystring_to_array(STRING_PLACEHOLDER)
The quality array corresponding to an unavailable query quality string ("*").
NO_REF_INDEX
module-attribute
¶
The reference index to use to indicate no reference in SAM/BAM.
NO_REF_NAME
module-attribute
¶
NO_REF_NAME: str = STRING_PLACEHOLDER
The reference name to use to indicate no reference in SAM/BAM.
NO_REF_POS
module-attribute
¶
The reference position to use to indicate no position in SAM/BAM.
STRING_PLACEHOLDER
module-attribute
¶
The value to use when a string field's information is unavailable.
SamPath
module-attribute
¶
The valid base classes for opening a SAM/BAM/CRAM file.
Classes¶
Cigar ¶
Class representing a cigar string.
Attributes:
| Name | Type | Description |
|---|---|---|
- |
elements (Tuple[CigarElement, ...]
|
zero or more cigar elements |
Source code in fgpyo/sam/__init__.py
511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 | |
Functions¶
classmethod
¶from_cigarstring(cigarstring: str) -> Cigar
Constructs a Cigar from a string returned by pysam.
If "*" is given, returns an empty Cigar.
Source code in fgpyo/sam/__init__.py
classmethod
¶from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) -> Cigar
Returns a Cigar from a list of tuples returned by pysam.
Each tuple denotes the operation and length. See
CigarOp() for more information on the
various operators. If None is given, returns an empty Cigar.
Source code in fgpyo/sam/__init__.py
Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.
The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned.
If counting from the end of the query is desired, use
cigar.reversed().query_alignment_offsets()
Returns:
| Type | Description |
|---|---|
Tuple[int, int]
|
A tuple (start, stop) containing the start and stop positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If according to the cigar, there are no aligned query bases. |
Source code in fgpyo/sam/__init__.py
CigarElement ¶
Represents an element in a Cigar
Attributes:
| Name | Type | Description |
|---|---|---|
- |
length (int
|
the length of the element |
- |
operator (CigarOp
|
the operator of the element |
Source code in fgpyo/sam/__init__.py
CigarOp ¶
Bases: Enum
Enumeration of operators that can appear in a Cigar string.
Attributes:
| Name | Type | Description |
|---|---|---|
code |
int
|
The |
character |
int
|
The single character cigar operator. |
consumes_query |
bool
|
True if this operator consumes query bases, False otherwise. |
consumes_target |
bool
|
True if this operator consumes target bases, False otherwise. |
Source code in fgpyo/sam/__init__.py
Attributes¶
property
¶Returns true if the operator is a soft/hard clip, false otherwise.
Functions¶
staticmethod
¶from_character(character: str) -> CigarOp
Returns the operator from the single character.
CigarParsingException ¶
PairOrientation ¶
Bases: Enum
Enumerations of read pair orientations.
Source code in fgpyo/sam/__init__.py
Attributes¶
class-attribute
instance-attribute
¶A pair orientation for forward-reverse reads ("innie").
class-attribute
instance-attribute
¶A pair orientation for reverse-forward reads ("outie").
class-attribute
instance-attribute
¶A pair orientation for tandem (forward-forward or reverse-reverse) reads.
Functions¶
classmethod
¶from_recs(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> Optional[PairOrientation]
Returns the pair orientation if both reads are mapped to the same reference sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
Optional[AlignedSegment]
|
The second record in the pair. If None, then mate info on |
None
|
Source code in fgpyo/sam/__init__.py
ReadEditInfo ¶
Counts various stats about how a read compares to a reference sequence.
Attributes:
| Name | Type | Description |
|---|---|---|
matches |
int
|
the number of bases in the read that match the reference |
mismatches |
int
|
the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch. |
insertions |
int
|
the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string. |
inserted_bases |
int
|
the total number of bases contained within insertions in the read |
deletions |
int
|
the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string. |
deleted_bases |
int
|
the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read). |
nm |
int
|
the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases |
Source code in fgpyo/sam/__init__.py
SamFileType ¶
Bases: Enum
Enumeration of valid SAM/BAM/CRAM file types.
Attributes:
| Name | Type | Description |
|---|---|---|
mode |
str
|
The additional mode character to add when opening this file type. |
ext |
str
|
The standard file extension for this file type. |
Source code in fgpyo/sam/__init__.py
Attributes¶
Functions¶
classmethod
¶from_path(path: Union[Path, str]) -> SamFileType
Infers the file type based on the file extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[Path, str]
|
the path to the SAM/BAM/CRAM to read or write. |
required |
Source code in fgpyo/sam/__init__.py
SamOrder ¶
Bases: Enum
Enumerations of possible sort orders for a SAM file.
Source code in fgpyo/sam/__init__.py
SupplementaryAlignment ¶
Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.
Attributes:
| Name | Type | Description |
|---|---|---|
reference_name |
str
|
the name of the reference (i.e. contig, chromosome) aligned to |
start |
int
|
the 0-based start position of the alignment |
is_forward |
bool
|
true if the alignment is in the forward strand, false otherwise |
cigar |
Cigar
|
the cigar for the alignment |
mapq |
int
|
the mapping quality |
nm |
int
|
the number of edits |
Source code in fgpyo/sam/__init__.py
800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 | |
Attributes¶
Functions¶
classmethod
¶from_read(read: AlignedSegment) -> List[SupplementaryAlignment]
Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read
|
AlignedSegment
|
An alignment. The presence of the "SA" tag is not required. |
required |
Returns:
| Type | Description |
|---|---|
List[SupplementaryAlignment]
|
A list of all SupplementaryAlignments present in the SA tag. |
List[SupplementaryAlignment]
|
If the SA tag is not present, or it is empty, an empty list will be returned. |
Source code in fgpyo/sam/__init__.py
staticmethod
¶parse(string: str) -> SupplementaryAlignment
Returns a supplementary alignment parsed from the given string. The various fields
should be comma-delimited (ex. chr1,123,-,100M50S,60,4)
Source code in fgpyo/sam/__init__.py
staticmethod
¶parse_sa_tag(tag: str) -> List[SupplementaryAlignment]
Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.
Source code in fgpyo/sam/__init__.py
Template ¶
A container for alignment records corresponding to a single sequenced template or insert.
It is strongly preferred that new Template instances be created with Template.build()
which will ensure that reads are stored in the correct Template property, and run basic
validations of the Template by default. If constructing Template instances by construction
users are encouraged to use the validate method post-construction.
In the special cases there are alignments records that are both secondary and supplementary
then they will be stored upon the r1_supplementals and r2_supplementals fields only.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
the name of the template/query |
r1 |
Optional[AlignedSegment]
|
Primary non-supplementary alignment for read 1, or None if there is none |
r2 |
Optional[AlignedSegment]
|
Primary non-supplementary alignment for read 2, or None if there is none |
r1_supplementals |
List[AlignedSegment]
|
Supplementary alignments for read 1 |
r2_supplementals |
List[AlignedSegment]
|
Supplementary alignments for read 2 |
r1_secondaries |
List[AlignedSegment]
|
Secondary (non-primary, non-supplementary) alignments for read 1 |
r2_secondaries |
List[AlignedSegment]
|
Secondary (non-primary, non-supplementary) alignments for read 2 |
Source code in fgpyo/sam/__init__.py
1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 | |
Functions¶
Yields all R1 alignments of this template including secondary and supplementary.
Source code in fgpyo/sam/__init__.py
Yields all R2 alignments of this template including secondary and supplementary.
Source code in fgpyo/sam/__init__.py
Returns a list with all the records for the template.
Source code in fgpyo/sam/__init__.py
staticmethod
¶build(recs: Iterable[AlignedSegment], validate: bool = True) -> Template
Build a template from a set of records all with the same queryname.
Source code in fgpyo/sam/__init__.py
staticmethod
¶iterator(alns: Iterator[AlignedSegment]) -> Iterator[Template]
Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.
Source code in fgpyo/sam/__init__.py
set_mate_info(is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> Self
Reset all mate information on every alignment in the template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
is_proper_pair
|
Callable[[AlignedSegment, AlignedSegment], bool]
|
A function that takes two alignments and determines proper pair status. |
is_proper_pair
|
isize
|
Callable[[AlignedSegment, AlignedSegment], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Source code in fgpyo/sam/__init__.py
Add a tag to all records associated with the template.
Setting a tag to None will remove the tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tag
|
str
|
The name of the tag. |
required |
value
|
Union[str, int, float, None]
|
The value of the tag. |
required |
Source code in fgpyo/sam/__init__.py
Performs sanity checks that all the records in the Template are as expected.
Source code in fgpyo/sam/__init__.py
Write the records associated with the template to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
writer
|
AlignmentFile
|
An open, writable AlignmentFile. |
required |
primary_only
|
bool
|
If True, only write primary alignments. |
False
|
Source code in fgpyo/sam/__init__.py
TemplateIterator ¶
Bases: Iterator[Template]
An iterator that converts an iterator over query-grouped reads into an iterator over templates.
Source code in fgpyo/sam/__init__.py
Functions¶
calculate_edit_info ¶
calculate_edit_info(rec: AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) -> ReadEditInfo
Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the
reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag.
The read must be aligned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the read/record for which to calculate values |
required |
reference_sequence
|
str
|
the reference sequence (or fragment thereof) that the read is aligned to |
required |
reference_offset
|
Optional[int]
|
if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start |
None
|
Returns:
| Type | Description |
|---|---|
ReadEditInfo
|
a ReadEditInfo with information about how the read differs from the reference |
Source code in fgpyo/sam/__init__.py
is_proper_pair ¶
is_proper_pair(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None, max_insert_size: int = 1000, orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> bool
Determines if a pair of records are properly paired or not.
Criteria for records in a proper pair are
- Both records are aligned
- Both records are aligned to the same reference sequence
- The pair orientation of the records is one of the valid pair orientations (default "FR")
- The inferred insert size is not more than a maximum length (default 1000)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
Optional[AlignedSegment]
|
The second record in the pair. If None, then mate info on |
None
|
max_insert_size
|
int
|
The maximum insert size to consider a pair "proper". |
1000
|
orientations
|
Collection[PairOrientation]
|
The valid set of orientations to consider a pair "proper". |
DefaultProperlyPairedOrientations
|
isize
|
Callable[[AlignedSegment, AlignedSegment], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Source code in fgpyo/sam/__init__.py
isize ¶
Computes the insert size ("template length" or "TLEN") for a pair of records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
Optional[AlignedSegment]
|
The second record in the pair. If None, then mate info on |
None
|
Source code in fgpyo/sam/__init__.py
reader ¶
reader(path: SamPath, file_type: Optional[SamFileType] = None, unmapped: bool = False) -> AlignmentFile
Opens a SAM/BAM/CRAM for reading.
To read from standard input, provide any of "-", "stdin", or "/dev/stdin" as the input
path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
SamPath
|
a file handle or path to the SAM/BAM/CRAM to read or write. |
required |
file_type
|
Optional[SamFileType]
|
the file type to assume when opening the file. If None, then the file type will be auto-detected. |
None
|
unmapped
|
bool
|
True if the file is unmapped and has no sequence dictionary, False otherwise. |
False
|
Source code in fgpyo/sam/__init__.py
set_mate_info ¶
set_mate_info(rec1: AlignedSegment, rec2: AlignedSegment, is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> None
Resets mate pair information between two primary alignments that share a query name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
AlignedSegment
|
The second record in the pair. |
required |
is_proper_pair
|
Callable[[AlignedSegment, AlignedSegment], bool]
|
A function that takes the two alignments and determines proper pair status. |
is_proper_pair
|
isize
|
Callable[[AlignedSegment, AlignedSegment], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If rec1 and rec2 are of the same read ordinal. |
ValueError
|
If either rec1 or rec2 is secondary or supplementary. |
ValueError
|
If rec1 and rec2 do not share the same query name. |
Source code in fgpyo/sam/__init__.py
set_mate_info_on_secondary ¶
Set mate info on a secondary alignment from its mate's primary alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
secondary
|
AlignedSegment
|
The secondary alignment to set mate information upon. |
required |
mate_primary
|
AlignedSegment
|
The primary alignment of the secondary's mate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If secondary and mate_primary are of the same read ordinal. |
ValueError
|
If secondary and mate_primary do not share the same query name. |
ValueError
|
If mate_primary is secondary or supplementary. |
ValueError
|
If secondary is not marked as a secondary alignment. |
Source code in fgpyo/sam/__init__.py
set_mate_info_on_supplementary ¶
Set mate info on a supplementary alignment from its mate's primary alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
supp
|
AlignedSegment
|
The supplementary alignment to set mate information upon. |
required |
mate_primary
|
AlignedSegment
|
The primary alignment of the supplementary's mate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If supp and mate_primary are of the same read ordinal. |
ValueError
|
If supp and mate_primary do not share the same query name. |
ValueError
|
If mate_primary is secondary or supplementary. |
ValueError
|
If supp is not marked as a supplementary alignment. |
Source code in fgpyo/sam/__init__.py
set_pair_info ¶
Resets mate pair information between reads in a pair.
Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
r1
|
AlignedSegment
|
Read 1 (first read in the template). |
required |
r2
|
AlignedSegment
|
Read 2 with the same query name as r1 (second read in the template). |
required |
proper_pair
|
bool
|
whether the pair is proper or not. |
True
|
Source code in fgpyo/sam/__init__.py
sum_of_base_qualities ¶
Calculate the sum of base qualities score for an alignment record.
This function is useful for calculating the "mate score" as implemented in samtools fixmate.
Consistently with samtools fixmate, this function returns 0 if the record has no base
qualities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
The alignment record to calculate the sum of base qualities from. |
required |
min_quality_score
|
int
|
The minimum base quality score to use for summation. |
15
|
Returns:
| Type | Description |
|---|---|
int
|
The sum of base qualities on the input record. 0 if the record has no base qualities. |
Source code in fgpyo/sam/__init__.py
writer ¶
writer(path: SamPath, header: Union[str, Dict[str, Any], AlignmentHeader], file_type: Optional[SamFileType] = None) -> AlignmentFile
Opens a SAM/BAM/CRAM for writing.
To write to standard output, provide any of "-", "stdout", or "/dev/stdout" as the output
path. Note: When writing to stdout, the file_type must be given.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
SamPath
|
a file handle or path to the SAM/BAM/CRAM to read or write. |
required |
header
|
Union[str, Dict[str, Any], AlignmentHeader]
|
Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order. |
required |
file_type
|
Optional[SamFileType]
|
the file type to assume when opening the file. If |
None
|
Source code in fgpyo/sam/__init__.py
Modules¶
builder ¶
Classes for generating SAM and BAM files and records for testing¶
This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.
Classes¶
Builder for constructing one or more sam records (AlignmentSegments in pysam terms).
Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.
A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.
Records are then added using the add_pair()
method. Once accumulated the records can be accessed in the order in which they were created
through the to_unsorted_list()
function, or in a list sorted by coordinate order via
to_sorted_list(). The latter creates
a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can
be written to a temporary file using
to_path().
Source code in fgpyo/sam/builder.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 | |
__init__(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: SamOrder = Coordinate) -> None
Initializes a new SamBuilder for generating alignment records and SAM/BAM files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
r1_len
|
Optional[int]
|
The length of R1s to create unless otherwise specified |
None
|
r2_len
|
Optional[int]
|
The length of R2s to create unless otherwise specified |
None
|
base_quality
|
int
|
The base quality of bases to create unless otherwise specified |
30
|
sd
|
Optional[List[Dict[str, Any]]]
|
a sequence dictionary as a list of dicts; defaults to calling default_sd() if None |
None
|
rg
|
Optional[Dict[str, str]]
|
a single read group as a dict; defaults to calling default_sd() if None |
None
|
extra_header
|
Optional[Dict[str, Any]]
|
a dictionary of extra values to add to the header, None otherwise. See
|
None
|
seed
|
int
|
a seed value for random number/string generation |
42
|
sort_order
|
SamOrder
|
Order to sort records when writing to file, or output of to_sorted_list() |
Coordinate
|
Source code in fgpyo/sam/builder.py
add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = NO_REF_POS, start2: int = NO_REF_POS, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) -> Tuple[AlignedSegment, AlignedSegment]
Generates a new pair of reads, adds them to the internal collection, and returns them.
Most fields are optional.
Mapped pairs can be created by specifying both start1 and start2 and either chrom, for
pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs
where reads map to different contigs. i.e.:
- `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
the same contig (`chrom`).
- `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
map to different contigs (`chrom1` and `chrom2`).
A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.
- `add_pair(chrom, start1)`
- `add_pair(chrom1, start1)`
- `add_pair(chrom, start2)`
- `add_pair(chrom2, start2)`
An unmapped pair can be created by calling the method with no parameters (specifically,
not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is
provided, it will be ignored.
For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
Optional[str]
|
The name of the template. If None is given a unique name will be auto-generated. |
None
|
bases1
|
Optional[str]
|
The bases for R1. If None is given a random sequence is generated. |
None
|
bases2
|
Optional[str]
|
The bases for R2. If None is given a random sequence is generated. |
None
|
quals1
|
Optional[List[int]]
|
The list of int qualities for R1. If None, the default base quality is used. |
None
|
quals2
|
Optional[List[int]]
|
The list of int qualities for R2. If None, the default base quality is used. |
None
|
chrom
|
Optional[str]
|
The chromosome to which both reads are mapped. Defaults to the unmapped value. |
None
|
chrom1
|
Optional[str]
|
The chromosome to which R1 is mapped. If None, |
None
|
chrom2
|
Optional[str]
|
The chromosome to which R2 is mapped. If None, |
None
|
start1
|
int
|
The start position of R1. Defaults to the unmapped value. |
NO_REF_POS
|
start2
|
int
|
The start position of R2. Defaults to the unmapped value. |
NO_REF_POS
|
cigar1
|
Optional[str]
|
The cigar string for R1. Defaults to None for unmapped reads, otherwise all M. |
None
|
cigar2
|
Optional[str]
|
The cigar string for R2. Defaults to None for unmapped reads, otherwise all M. |
None
|
mapq1
|
Optional[int]
|
Mapping quality for R1. Defaults to self.mapping_quality if None. |
None
|
mapq2
|
Optional[int]
|
Mapping quality for R2. Defaults to self.mapping_quality if None. |
None
|
strand1
|
str
|
The strand for R1, either "+" or "-". Defaults to "+". |
'+'
|
strand2
|
str
|
The strand for R2, either "+" or "-". Defaults to "-". |
'-'
|
attrs
|
Optional[Dict[str, Any]]
|
An optional dictionary of SAM attribute to place on both R1 and R2. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if either strand field is not "+" or "-" |
ValueError
|
if bases/quals/cigar are set in a way that is not self-consistent |
Returns:
| Type | Description |
|---|---|
Tuple[AlignedSegment, AlignedSegment]
|
Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2. |
Source code in fgpyo/sam/builder.py
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = NO_REF_NAME, start: int = NO_REF_POS, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) -> AlignedSegment
Generates a new single reads, adds them to the internal collection, and returns it.
Most fields are optional.
If read_num is None (the default) an unpaired read will be created. If read_num is
set to 1 or 2, the read will have it's paired flag set and read number flags set.
An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.
A mapped read is created by providing chrom and start.
The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
Optional[str]
|
The name of the template. If None is given a unique name will be auto-generated. |
None
|
read_num
|
Optional[int]
|
Either None, 1 for R1 or 2 for R2 |
None
|
bases
|
Optional[str]
|
The bases for the read. If None is given a random sequence is generated. |
None
|
quals
|
Optional[List[int]]
|
The list of qualities for the read. If None, the default base quality is used. |
None
|
chrom
|
str
|
The chromosome to which both reads are mapped. Defaults to the unmapped value. |
NO_REF_NAME
|
start
|
int
|
The start position of the read. Defaults to the unmapped value. |
NO_REF_POS
|
cigar
|
Optional[str]
|
The cigar string for R1. Defaults to None for unmapped reads, otherwise all M. |
None
|
mapq
|
Optional[int]
|
Mapping quality for the read. Default to self.mapping_quality if not given. |
None
|
strand
|
str
|
The strand for R1, either "+" or "-". Defaults to "+". |
'+'
|
secondary
|
bool
|
If true the read will be flagged as secondary |
False
|
supplementary
|
bool
|
If true the read will be flagged as supplementary |
False
|
attrs
|
Optional[Dict[str, Any]]
|
An optional dictionary of SAM attribute to place on both R1 and R2. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if strand field is not "+" or "-" |
ValueError
|
if read_num is not None, 1 or 2 |
ValueError
|
if bases/quals/cigar are set in a way that is not self-consistent |
Returns:
| Name | Type | Description |
|---|---|---|
AlignedSegment |
AlignedSegment
|
The record created |
Source code in fgpyo/sam/builder.py
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 | |
staticmethod
¶Returns the default read group used by the SamBuilder, as a dictionary.
staticmethod
¶Generates the sequence dictionary that is used by default by SamBuilder.
Matches the names and lengths of the HG19 reference in use in production.
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
A new copy of the sequence dictionary as a list of dictionaries, one per chromosome. |
Source code in fgpyo/sam/builder.py
Returns the single read group that is defined in the header.
Source code in fgpyo/sam/builder.py
Returns the ID of the single read group that is defined in the header.
Source code in fgpyo/sam/builder.py
to_path(path: Optional[Path] = None, index: bool = True, pred: Callable[[AlignedSegment], bool] = lambda r: True, tmp_file_type: Optional[SamFileType] = None) -> Path
Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.
If path is provided, tmp_file_type may not be provided. In this case, the file type
(SAM/BAM/CRAM) will be automatically determined by the file extension when a path
is provided. See ~pysam for more details.
If path is not provided, the file type will default to BAM unless tmp_file_type is
provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Optional[Path]
|
a path at which to write the file, otherwise a temp file is used. |
None
|
index
|
bool
|
if True and |
True
|
pred
|
Callable[[AlignedSegment], bool]
|
optional predicate to specify which reads should be output |
lambda r: True
|
tmp_file_type
|
Optional[SamFileType]
|
the file type to output when a path is not provided (default is BAM) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
The path to the sorted (and possibly indexed) file. |
Source code in fgpyo/sam/builder.py
Returns the accumulated records in coordinate order.
Source code in fgpyo/sam/builder.py
clipping ¶
Utility Functions for Soft-Clipping records in SAM/BAM Files¶
This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:
softclip_start_of_alignment_by_query()clips the start of the alignment in terms of query basessoftclip_end_of_alignment_by_query()clips the end of the alignment in terms of query basessoftclip_start_of_alignment_by_ref()clips the start of the alignment in terms of reference basessoftclip_end_of_alignment_by_ref()clips the end of the alignment in terms of reference bases
The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.
Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.
For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:
>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
... for rec in fh:
... before = rec.cigarstring
... info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
... after = rec.cigarstring
... print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)
It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:
- Mate pairs may have incorrect information about their mate's positions
- Even if the input was coordinate sorted, the output may be out of order
To rectify these problems it is necessary to do the equivalent of:
Classes¶
Bases: NamedTuple
Named tuple holding the number of bases clipped on the query and reference respectively.
Source code in fgpyo/sam/clipping.py
Functions¶
softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the end of a read's alignment.
Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the end of an alignment by bases_to_clip bases on the reference.
Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the start of a read's alignment.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the start of an alignment by bases_to_clip bases on the reference.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
Optional[int]
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
sequence ¶
Utility Functions for Manipulating DNA and RNA sequences.¶
This module contains utility functions for manipulating DNA and RNA sequences.
levenshtein and hamming functions are included for convenience.
If you are performing many distance calculations, using a C based method is preferable.
ex. https://pypi.org/project/Distance/
Functions¶
complement ¶
gc_content ¶
Calculates the fraction of G and C bases in a sequence.
hamming ¶
Calculates hamming distance between two strings, case sensitive. Strings must be of equal lengths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string1
|
str
|
first string for comparison |
required |
string2
|
str
|
second string for comparison |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If strings are of different lengths. |
Source code in fgpyo/sequence.py
levenshtein ¶
Calculates levenshtein distance between two strings, case sensitive.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string1
|
str
|
first string for comparison |
required |
string2
|
str
|
second string for comparison |
required |
Source code in fgpyo/sequence.py
longest_dinucleotide_run_length ¶
Number of bases in the longest dinucleotide run in a primer.
A dinucleotide run is when two nucleotides are repeated in tandem. For example, TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the number of bases in the longest dinuc repeat (NOT the number of repeat units)
Source code in fgpyo/sequence.py
longest_homopolymer_length ¶
Calculates the length of the longest homopolymer in the input sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the length of the longest homopolymer
Source code in fgpyo/sequence.py
longest_hp_length ¶
Calculates the length of the longest homopolymer in the input sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the length of the longest homopolymer
Source code in fgpyo/sequence.py
longest_multinucleotide_run_length ¶
Number of bases in the longest multi-nucleotide run.
A multi-nucleotide run is when N nucleotides are repeated in tandem. For example, TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs, returns 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
repeat_unit_length
|
int
|
the length of the multi-nucleotide repetitive unit (must be > 0) |
required |
Returns:
| Type | Description |
|---|---|
int
|
the number of bases in the longest multinucleotide repeat (NOT the number of repeat units) |
Source code in fgpyo/sequence.py
reverse_complement ¶
Reverse complements a base sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases to be reverse complemented. |
required |
Returns:
| Type | Description |
|---|---|
str
|
the reverse complement of the provided base string |
Source code in fgpyo/sequence.py
util ¶
Modules¶
inspect ¶
Attributes¶
module-attribute
¶TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass
Functions¶
attr_from(cls: Type[_AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> _AttrFromType
Builds an attr or dataclasses class from key-word arguments
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
Type[_AttrFromType]
|
the attr or dataclasses class to be built |
required |
kwargs
|
Dict[str, str]
|
a dictionary of keyword arguments |
required |
parsers
|
Optional[Dict[type, Callable[[str], Any]]]
|
a dictionary of parser functions to apply to specific types |
None
|
Source code in fgpyo/util/inspect.py
dict_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial
Returns a function that parses a stringified dict into a Dict of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
Type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAlias
|
the type of the attribute to be parsed parsers: an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
required |
Source code in fgpyo/util/inspect.py
get_fields(cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]]) -> Tuple[FieldType, ...]
Get the fields tuple from either a dataclasses or attr dataclass (or instance)
Source code in fgpyo/util/inspect.py
get_fields_dict(cls: Union[_DataclassesOrAttrClass, Type[_DataclassesOrAttrClass]]) -> Mapping[str, FieldType]
Get the fields dict from either a dataclasses or attr dataclass (or instance)
Source code in fgpyo/util/inspect.py
list_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial
Returns a function that parses a "stringified" list into a List of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
Type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAlias
|
the type of the attribute to be parsed |
required |
parsers
|
Optional[Dict[type, Callable[[str], Any]]]
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
set_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial
Returns a function that parses a stringified set into a Set of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
Type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAlias
|
the type of the attribute to be parsed |
required |
parsers
|
Optional[Dict[type, Callable[[str], Any]]]
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) -> List[str]
Splits a nested field by its outer-most level
Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth
Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO
Source code in fgpyo/util/inspect.py
tuple_parser(cls: Type, type_: TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) -> partial
Returns a function that parses a stringified tuple into a Tuple of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
Type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAlias
|
the type of the attribute to be parsed |
required |
parsers
|
Optional[Dict[type, Callable[[str], Any]]]
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
Modules¶
logging ¶
Methods for setting up logging for tools.¶
Progress Logging Examples¶
Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging
progress is useful to not only log how many inputs have been consumed, but also their genomic
coordinate. ProgressLogger() can log progress every
fixed number of records. Logging can be written to logging.Logger as well as custom print
method.
>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
... printer=lambda s: logged_lines.append(s),
... verb="recorded",
... noun="items",
... unit=2
... )
>>> progress.record(reference_name="chr1", position=1) # does not log
False
>>> progress.record(reference_name="chr1", position=2) # logs
True
>>> progress.record(reference_name="chr1", position=3) # does not log
False
>>> progress.log_last() # will log the last recorded item, if not previously logged
True
>>> logged_lines # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']
Classes¶
Bases: AbstractContextManager
A little class to track progress.
This will output a log message every unit number times recorded.
Attributes:
| Name | Type | Description |
|---|---|---|
printer |
Callable[[str], Any]
|
either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string |
noun |
str
|
the noun to use in the log message |
verb |
str
|
the verb to use in the log message |
unit |
int
|
the number of items for every log message |
count |
int
|
the total count of items recorded |
Source code in fgpyo/util/logging.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
Force logging the last record, for example when progress has completed.
Source code in fgpyo/util/logging.py
Record an item at a given genomic coordinate. Args: reference_name: the reference name of the item position: the 1-based start position of the item Returns: true if a message was logged, false otherwise
Source code in fgpyo/util/logging.py
Correctly record pysam.AlignedSegments (zero-based coordinates).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
pysam.AlignedSegment object |
required |
Returns:
| Type | Description |
|---|---|
bool
|
true if a message was logged, false otherwise |
Source code in fgpyo/util/logging.py
Correctly record multiple pysam.AlignedSegments (zero-based coordinates).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
recs
|
Iterable[AlignedSegment]
|
pysam.AlignedSegment objects |
required |
Returns:
| Type | Description |
|---|---|
bool
|
true if a message was logged, false otherwise |
Source code in fgpyo/util/logging.py
Functions¶
Globally configure logging for all modules
Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str
|
the default level for the logger |
'INFO'
|
name
|
str
|
the name of the logger |
'fgpyo'
|
Source code in fgpyo/util/logging.py
metric ¶
Metrics¶
Module for storing, reading, and writing metric-like tab-delimited information.
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This
makes it easy for them to be read in languages like R. For example, a row per person, with
columns for age, gender, and address.
The Metric() class makes it easy to read, write, and store
one or metrics of the same type, all the while preserving types for each value in a metric. It is
an abstract base class decorated by
@dataclass, or
@attr.s, with attributes storing one or more
typed values. If using multiple layers of inheritance, keep in mind that it's not possible to mix
these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately
initialize the values of the attr superclass.
Examples¶
Defining a new metric class:
>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
... name: str
... age: int
or using attr:
>>> from fgpyo.util.metric import Metric
>>> import attr
>>> from typing import Optional
>>> @attr.s(auto_attribs=True, frozen=True)
... class PersonAttr(Metric["PersonAttr"]):
... name: str
... age: int
... address: Optional[str] = None
Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.
Getting the values from a metric class instance. The values are in the same order as the header.
Writing a list of metrics to a file:
>>> metrics = [
... Person(name="Alice", age=47),
... Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)
Then the contents of the written metrics file:
Reading the metrics file back in:
>>> list(Person.read(Path("/path/to/metrics.txt")))
[Person(name='Alice', age=47), Person(name='Bob', age=24)]
Formatting and parsing the values for custom types is supported by overriding the _parsers() and
format_value() methods.
>>> @dataclasses.dataclass(frozen=True)
... class Name:
... first: str
... last: str
... @classmethod
... def parse(cls, value: str) -> "Name":
... fields = value.split(" ")
... return Name(first=fields[0], last=fields[1])
>>> from typing import Dict, Callable, Any
>>> @dataclasses.dataclass(frozen=True)
... class PersonWithName(Metric["PersonWithName"]):
... name: Name
... age: int
... @classmethod
... def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
... return {Name: lambda value: Name.parse(value=value)}
... @classmethod
... def format_value(cls, value: Any) -> str:
... if isinstance(value, Name):
... return f"{value.first} {value.last}"
... else:
... return super().format_value(value=value)
>>> PersonWithName.parse(fields=["john doe", "42"])
PersonWithName(name=Name(first='john', last='doe'), age=42)
>>> PersonWithName(name=Name(first='john', last='doe'), age=42).formatted_values()
['john doe', '42']
Classes¶
Bases: ABC, Generic[MetricType]
Abstract base class for all metric-like tab-delimited files
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This
makes it easy for them to be read in languages like R.
Subclasses of Metric() can support parsing and
formatting custom types with _parsers() and
format_value().
Source code in fgpyo/util/metric.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 | |
classmethod
¶The default method to format values of a given type.
By default, this method will comma-delimit list, tuple, and set types, and apply
str to all others.
Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.
In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries with '{}'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
the value to format. |
required |
Source code in fgpyo/util/metric.py
An iterator over formatted attribute values in the same order as the header.
An iterator over formatted attribute values in the same order as the header.
classmethod
¶An iterator over field names and their corresponding values in the same order as the header.
Source code in fgpyo/util/metric.py
classmethod
¶An iterator over field names in the same order as the header.
classmethod
¶Parses the string-representation of this metric. One string per attribute should be given.
Source code in fgpyo/util/metric.py
classmethod
¶read(path: Path, ignore_extra_fields: bool = True, strip_whitespace: bool = False, threads: Optional[int] = None) -> Iterator[Any]
Reads in zero or more metrics from the given path.
The metric file must contain a matching header.
Columns that are not present in the file but are optional in the metric class will be default values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
the path to the metrics file. |
required |
ignore_extra_fields
|
bool
|
True to ignore any extra columns, False to raise an exception. |
True
|
strip_whitespace
|
bool
|
True to strip leading and trailing whitespace from each field, False to keep as-is. |
False
|
threads
|
Optional[int]
|
the number of threads to use when decompressing gzip files |
None
|
Source code in fgpyo/util/metric.py
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | |
classmethod
¶An iterator over attribute values in the same order as the header.
classmethod
¶Writes zero or more metrics to the given path.
The header will always be written.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the output file. |
required |
values
|
MetricType
|
Zero or more metrics. |
()
|
threads
|
Optional[int]
|
the number of threads to use when compressing gzip files |
None
|
Source code in fgpyo/util/metric.py
dataclass
¶Header of a file.
A file's header contains an optional preamble, consisting of lines prefixed by a comment character and/or empty lines, and a required row of fieldnames before the data rows begin.
Attributes:
| Name | Type | Description |
|---|---|---|
preamble |
List[str]
|
A list of any lines preceding the fieldnames. |
fieldnames |
List[str]
|
The field names specified in the final line of the header. |
Source code in fgpyo/util/metric.py
Bases: Generic[MetricType], AbstractContextManager
Source code in fgpyo/util/metric.py
462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 | |
__init__(filename: Union[Path, str], metric_class: Type[Metric], append: bool = False, delimiter: str = '\t', include_fields: Optional[List[str]] = None, exclude_fields: Optional[List[str]] = None, lineterminator: str = '\n', threads: Optional[int] = None) -> None
Args:
filename: Path to the file to write.
metric_class: Metric class.
append: If `True`, the file will be appended to. Otherwise, the specified file will be
overwritten.
delimiter: The output file delimiter.
include_fields: If specified, only the listed fieldnames will be included when writing
records to file. Fields will be written in the order provided.
May not be used together with `exclude_fields`.
exclude_fields: If specified, any listed fieldnames will be excluded when writing
records to file.
May not be used together with `include_fields`.
lineterminator: The string used to terminate lines produced by the MetricWriter.
Default = "
". threads: the number of threads to use when compressing gzip files
Raises:
TypeError: If the provided metric class is not a dataclass- or attr-decorated
subclass of `Metric`.
AssertionError: If the provided filepath is not writable.
AssertionError: If `append=True` and the provided file is not readable. (When appending,
we check to ensure that the header matches the specified metric class. The file must
be readable to get the header.)
ValueError: If `append=True` and the provided file is a FIFO (named pipe).
ValueError: If `append=True` and the provided file does not include a header.
ValueError: If `append=True` and the header of the provided file does not match the
specified metric class and the specified include/exclude fields.
Source code in fgpyo/util/metric.py
Write a single Metric instance to file.
The Metric is converted to a dictionary and then written using the underlying
csv.DictWriter. If the MetricWriter was created using the include_fields or
exclude_fields arguments, the fields of the Metric are subset and/or reordered
accordingly before writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
MetricType
|
An instance of the specified Metric. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the provided |
Source code in fgpyo/util/metric.py
Write multiple Metric instances to file.
Each Metric is converted to a dictionary and then written using the underlying
csv.DictWriter. If the MetricWriter was created using the include_fields or
exclude_fields arguments, the attributes of each Metric are subset and/or reordered
accordingly before writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Iterable[MetricType]
|
A sequence of instances of the specified Metric. |
required |
Source code in fgpyo/util/metric.py
Modules¶
string ¶
Functions¶
A simple version of Unix's column utility. This assumes the table is NxM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rows
|
List[List[str]]
|
the rows to adjust. Each row must have the same number of delimited fields. |
required |
delimiter
|
str
|
the delimiter for each field in a row. |
' '
|
Source code in fgpyo/util/string.py
types ¶
Attributes¶
module-attribute
¶A function parameter's type annotation may be any of the following:
1) type, when declaring any of the built-in Python types
2) typing._GenericAlias, when declaring generic collection types or union types using pre-PEP
585 and pre-PEP 604 syntax (e.g. List[int], Optional[int], or Union[int, None])
3) types.UnionType, when declaring union types using PEP604 syntax (e.g. int | None)
4) types.GenericAlias, when declaring generic collection types using PEP 585 syntax (e.g.
list[int])
types.GenericAlias is a subclass of type, but typing._GenericAlias and types.UnionType are
not and must be considered explicitly.
Functions¶
Returns true if the provided type can be constructed from a string
Source code in fgpyo/util/types.py
make_literal_parser(literal: Type[LiteralType], parsers: Iterable[Callable[[str], LiteralType]]) -> partial
Generates a parser function for a literal type object and a set of parsers for the possible parsers to that literal type object
Source code in fgpyo/util/types.py
Generates a parser function for a union type object and set of parsers for the possible parsers to that union type object
Source code in fgpyo/util/types.py
Returns None if the value is 'None', else raises an error
Parses strings into bools accounting for the many different text representations of bools that can be used
Source code in fgpyo/util/types.py
vcf ¶
Classes for generating VCF and records for testing¶
This module contains utility classes for the generation of VCF files and variant records, for use in testing.
The module contains the following public classes:
VariantBuilder()-- A builder class that allows the accumulation of variant records and access as a list and writing to file.
Examples¶
Typically, we have pysam.VariantRecord records obtained from reading
from a VCF file. The VariantBuilder() class builds
such records.
Variants are added with the add() method,
which returns a pysam.VariantRecord.
>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
... contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
... qual=40, filter=["PASS"]
... )
VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs
>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
... samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
... )
The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the
to_path() method:
The variants may also be retrieved in the order they were added via the
to_unsorted_list() method and
in coordinate sorted order via the
to_sorted_list() method.
Functions¶
reader ¶
Opens the given path for VCF reading
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
VcfPath
|
the path to a VCF, or an open file handle |
required |
Source code in fgpyo/vcf/__init__.py
writer ¶
Opens the given path for VCF writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
VcfPath
|
the path to a VCF, or an open filehandle |
required |
header
|
VariantHeader
|
the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header |
required |
Source code in fgpyo/vcf/__init__.py
Modules¶
builder ¶
Classes for generating VCF and records for testing¶
Classes¶
Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.
Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.
A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be provided to the VariantBuilder constructor.
Variants are then added using the add()
method.
Once accumulated the variants can be accessed in the order in which they were created through
the to_unsorted_list()
function, or in a list sorted by coordinate order via
to_sorted_list(). Lastly, the
records can be written to a temporary file using
to_path().
Attributes:
| Name | Type | Description |
|---|---|---|
sample_ids |
List[str]
|
the sample name(s) |
sd |
Dict[str, Dict[str, Any]]
|
sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as contig_name) and "length", the contig length. Other values will be added to the VCF header line for that contig. |
seq_idx_lookup |
Dict[str, int]
|
dictionary mapping contig name to index of contig in sd |
records |
List[VariantRecord]
|
the list of variant records |
header |
VariantHeader
|
the pysam header |
Source code in fgpyo/vcf/builder.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 | |
__init__(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None) -> None
Initializes a new VariantBuilder for generating variants and VCF files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_ids
|
Optional[Iterable[str]]
|
the name of the sample(s) |
None
|
sd
|
Optional[Dict[str, Dict[str, Any]]]
|
optional sequence dictionary |
None
|
Source code in fgpyo/vcf/builder.py
add(contig: Optional[str] = None, pos: int = 1000, end: Optional[int] = None, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) -> VariantRecord
Generates a new variant and adds it to the internal collection.
Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is the property that should be accessed when using the records produced by this function (not "start").
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contig
|
Optional[str]
|
the chromosome name. If None, will use the first contig in the sequence dictionary. |
None
|
pos
|
int
|
the 1-based position of the variant |
1000
|
end
|
Optional[int]
|
an optional 1-based inclusive END position; if not specified a value will be looked for in info["END"], or calculated from the length of the reference allele |
None
|
id
|
str
|
the variant id |
'.'
|
ref
|
str
|
the reference allele |
'A'
|
alts
|
Union[None, str, Iterable[str]]
|
the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt. |
('.',)
|
qual
|
int
|
the variant quality |
60
|
filter
|
Union[None, str, Iterable[str]]
|
the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter. |
None
|
info
|
Optional[Dict[str, Any]]
|
the dictionary of INFO key-value pairs |
None
|
samples
|
Optional[Dict[str, Dict[str, Any]]]
|
the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (".") for samples that don't have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing. |
None
|
Source code in fgpyo/vcf/builder.py
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | |
Add a FILTER header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
description
|
Optional[str]
|
the description of the field |
None
|
Source code in fgpyo/vcf/builder.py
add_format_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = NUM_GENOTYPES, description: Optional[str] = None) -> None
Add a FORMAT header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
field_type
|
VcfFieldType
|
the field_type of the field |
required |
number
|
Union[int, VcfFieldNumber]
|
the number of the field |
NUM_GENOTYPES
|
description
|
Optional[str]
|
the description of the field |
None
|
Source code in fgpyo/vcf/builder.py
add_info_header(name: str, field_type: VcfFieldType, number: Union[int, VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) -> None
Add an INFO header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
field_type
|
VcfFieldType
|
the field_type of the field |
required |
number
|
Union[int, VcfFieldNumber]
|
the number of the field |
1
|
description
|
Optional[str]
|
the description of the field |
None
|
source
|
Optional[str]
|
the source of the field |
None
|
version
|
Optional[str]
|
the version of the field |
None
|
Source code in fgpyo/vcf/builder.py
classmethod
¶Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, Any]]
|
A new copy of the sequence dictionary as a map of contig name to dictionary, one per |
Dict[str, Dict[str, Any]]
|
contig. |
Source code in fgpyo/vcf/builder.py
Returns a path to a VCF for variants added to this builder.
If the path given ends in ".gz" then the generated file will be bgzipped and a tabix index generated for the file with the suffix ".gz.tbi".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Optional[Path]
|
optional path to the VCF |
None
|