fasta
Modules¶
builder ¶
Classes for generating fasta files and records for testing¶
This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).
Examples of creating sets of contigs for writing to fasta¶
Writing a FASTA with two contigs each with 100 bases:
>>> from pathlib import Path
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder = builder.add("chr11").add("GGGGGGGGGG", 10)
>>> fasta_path = Path(getfixture("tmp_path")) / "test.fasta"
>>> builder.to_file(path=fasta_path)
Writing a FASTA with one contig with 100 A's and 50 T's:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder.to_file(path=fasta_path)
Add bases to existing contig:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> contig_one.bases
'AAAAAAAAAANNN'
Classes¶
ContigBuilder ¶
Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Unique contig ID, ie., "chr10" |
|
assembly |
Assembly information, if None default is 'testassembly' |
|
species |
Species information, if None default is 'testspecies' |
|
bases |
The bases to be added to the contig ex "A" |
Source code in fgpyo/fasta/builder.py
Functions¶
add(bases: str, times: int = 1) -> ContigBuilder
Method for adding bases to a new or existing instance of ContigBuilder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
The bases to be added to the contig |
required |
times
|
int
|
The number of times the bases should be repeated |
1
|
Example add("AAA", 2) results in the following bases -> "AAAAAA"
Source code in fgpyo/fasta/builder.py
FastaBuilder ¶
Builder for constructing sets of one or more contigs.
Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.
A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.
Contigs are added to FastaBuilder using:
add()
Bases are added to existing contigs using:
add()
Once accumulated the contigs can be written to a file using:
to_file()
Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).
Attributes:
| Name | Type | Description |
|---|---|---|
assembly |
str
|
Assembly information, if None default is 'testassembly' |
species |
str
|
Species, if None default is 'testspecies' |
line_length |
int
|
Desired line length, if None default is 80 |
contig_builders |
int
|
Private dictionary of contig names and instances of ContigBuilder |
Source code in fgpyo/fasta/builder.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | |
Functions¶
__getitem__(key: str) -> ContigBuilder
add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) -> ContigBuilder
Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique contig ID, ie., "chr10" |
required |
assembly
|
Optional[str]
|
Assembly information, if None default is 'testassembly' |
None
|
species
|
Optional[str]
|
Species information, if None default is 'testspecies' |
None
|
Source code in fgpyo/fasta/builder.py
Writes out the set of accumulated contigs to a FASTA file at the path given.
Also generates the accompanying fasta index file (.fa.fai) and sequence
dictionary file (.dict).
Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write files to. |
required |
Example: FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
Source code in fgpyo/fasta/builder.py
Functions¶
pysam_dict ¶
Calls pysam.dict and writes the sequence dictionary to the provided output path
Args assembly: Assembly species: Species output_path: File path to write dictionary to input_path: Path to fasta file
Source code in fgpyo/fasta/builder.py
pysam_faidx ¶
Calls pysam.faidx and writes fasta index in the same file location as the fasta file
Args input_path: Path to fasta file
sequence_dictionary ¶
Classes for representing sequencing dictionaries.¶
Examples of building and using sequence dictionaries¶
Building a sequence dictionary from a pysam.AlignmentHeader:
>>> import pysam
>>> from fgpyo.fasta.sequence_dictionary import SequenceDictionary
>>> sd: SequenceDictionary
>>> with pysam.AlignmentFile("./tests/fgpyo/sam/data/valid.sam") as fh:
... sd = SequenceDictionary.from_sam(fh.header)
>>> print(sd)
@SQ SN:chr1 LN:101
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Query based on index:
Query based on name:
Add, get, and delete attributes:
>>> from fgpyo.fasta.sequence_dictionary import Keys
>>> meta = sd[0]
>>> print(meta)
@SQ SN:chr1 LN:101
>>> meta[Keys.ASSEMBLY] = "hg38"
>>> print(meta)
@SQ SN:chr1 LN:101 AS:hg38
>>> meta.get(Keys.ASSEMBLY)
'hg38'
>>> meta.get(Keys.SPECIES) is None
True
>>> Keys.MD5 in meta
False
>>> del meta[Keys.ASSEMBLY]
>>> print(meta)
@SQ SN:chr1 LN:101
Get a sequence based on one of its aliases
>>> meta[Keys.ALIASES] = "foo,bar,car"
>>> sd = SequenceDictionary(infos=[meta] + sd.infos[1:])
>>> print(sd)
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
>>> print(sd["chr1"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
>>> print(sd["bar"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
Create a pysam.AlignmentHeader from a sequence dictionary:
>>> sd.to_sam_header()
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header())
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Create a pysam.AlignmentHeader from a sequence dictionary with extra header items:
>>> sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... )
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... ))
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
@RG ID:A LB:a-library
@RG ID:B LB:b-library
Attributes¶
SEQUENCE_NAME_PATTERN
module-attribute
¶
SEQUENCE_NAME_PATTERN: Pattern = compile('^[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*$')
Regular expression for valid reference sequence names according to the SAM spec
Classes¶
AlternateLocus
dataclass
¶
Stores an alternate locus for an associated sequence (1-based inclusive)
Source code in fgpyo/fasta/sequence_dictionary.py
Functions¶
Any post initialization validation should go here
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶parse(value: str) -> AlternateLocus
Parse the genomic interval of format: <contig>:<start>-<end>
Source code in fgpyo/fasta/sequence_dictionary.py
Keys ¶
Bases: StrEnum
Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line).
Source code in fgpyo/fasta/sequence_dictionary.py
SequenceDictionary
dataclass
¶
Bases: Mapping[Union[str, int], SequenceMetadata]
Contains an ordered collection of sequences.
A specific SequenceMetadata may be retrieved by name (str) or index (int), either by
using the generic get method or by the correspondingly named by_name and by_index methods.
The latter methods provide faster retrieval when the type is known.
This mapping collection iterates over the keys. To iterate over each SequenceMetadata,
either use the typical values() method or access the metadata directly with infos.
Attributes:
| Name | Type | Description |
|---|---|---|
infos |
List[SequenceMetadata]
|
the ordered collection of sequence metadata |
Source code in fgpyo/fasta/sequence_dictionary.py
395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 | |
Functions¶
by_index(index: int) -> SequenceMetadata
Gets a SequenceMetadata explicitly by name. Raises an IndexError
if the index is out of bounds.
by_name(name: str) -> SequenceMetadata
staticmethod
¶from_sam(data: Path) -> SequenceDictionary
from_sam(data: AlignmentFile) -> SequenceDictionary
from_sam(data: AlignmentHeader) -> SequenceDictionary
from_sam(data: List[Dict[str, Any]]) -> SequenceDictionary
from_sam(data: Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]) -> SequenceDictionary
Creates a SequenceDictionary from a SAM file or its header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Union[Path, AlignmentFile, AlignmentHeader, List[Dict[str, Any]]]
|
The input may be any of:
- a path to a SAM file
- an open |
required |
Returns:
A SequenceDictionary mapping refrence names to their metadata.
Source code in fgpyo/fasta/sequence_dictionary.py
get_by_name(name: str) -> Optional[SequenceMetadata]
Gets a SequenceMetadata explicitly by name. Returns None if
the name does not exist in this dictionary
same_as(other: SequenceDictionary) -> bool
Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence dictionary to a pysam.AlignmentHeader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extra_header
|
Optional[Dict[str, Any]]
|
a dictionary of extra values to add to the header, None otherwise. See
|
None
|
Source code in fgpyo/fasta/sequence_dictionary.py
SequenceMetadata
dataclass
¶
Bases: MutableMapping[Union[Keys, str], str]
Stores information about a single Sequence (ex. chromosome, contig).
Implements the mutable mapping interface, which provides access to the attributes of this
sequence, including name, length, but not index. When using the mapping interface, for example
getting, setting, deleting, as well as iterating over keys, values, and items, the values will
always be strings (str type). For example, the length will be an str when accessing via
get; access the length directly or use len to return an int. Similarly, use the
alias property to return a List[str] of aliases, use the alternate property to return
an AlternativeLocus-typed instance, and topology property to return a Toplogy-typed
instance.
All attributes except name and length may be set. Use dataclasses.replace to create a new
copy in such cases.
Important: The len method returns the length of the sequence, not the length of the
attributes. Use len(meta.attributes) for the latter.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
the primary name of the sequence |
length |
int
|
the length of the sequence, or zero if unknown |
index |
int
|
the index in the sequence dictionary |
attributes |
Dict[Union[Keys, str], str]
|
attributes of this sequence |
Source code in fgpyo/fasta/sequence_dictionary.py
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 | |
Attributes¶
property
¶A list of all names, including the primary name and aliases, in that order.
property
¶True if there is an alternate locus defined, False otherwise
Functions¶
Any post initialization validation should go here
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶from_sam(meta: Dict[Union[Keys, str], Any], index: int) -> SequenceMetadata
Builds a SequenceMetadata from a dictionary. The keys must include the sequence
name (Keys.SEQUENCE_NAME) and length (Keys.SEQUENCE_LENGTH). All other keys from
Keys will be stored in the resulting attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta
|
Dict[Union[Keys, str], Any]
|
the python dictionary with keys from |
required |
index
|
int
|
the 0-based index to use for this sequence |
required |
Source code in fgpyo/fasta/sequence_dictionary.py
same_as(other: SequenceMetadata) -> bool
Returns true if the sequences share a common reference name (including aliases), have the same length, and the same MD5 if both have MD5s.
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence metadata to a dictionary equivalent to one item in the
list of sequences from pysam.AlignmentHeader#to_dict()["SQ"].