Create Refflat-Style Annotations from Genomic Data

This function generates a refflat-style data frame from genomic annotations. The refflat format represents gene structures, including transcripts and exons, for use in genomic analyses. Data must be pre-loaded using load_annotation() followed by create_GTF_df().

refflat_create(
  input,
  geneName = "gene_name",
  name = "transcript_id",
  genetic_elements = c("TRANSCRIPT", "MRNA", "CDS", "GENE")
)

Arguments

input: A data frame containing genomic data with the following columns: - `chr`: Chromosome identifier - `start`: Start position of the annotation - `end`: End position of the annotation - `strand`: Strand information ('+' or '-') - `gene_name`: Name of the associated gene - `gene_id`: Unique identifier for the gene - `transcript_name`: Name of the associated transcript - `transcript_id`: Unique identifier for the transcript - `annotationType`: Type of annotation (e.g., 'EXON', 'TRANSCRIPT')
geneName: A string specifying the column name representing gene names (default: 'gene_name').
name: A string specifying the column name representing gene IDs (default: 'gene_id').
genetic_elements: Character vector (optional). A vector of genetic element types (e.g., 'CDS', 'GENE', 'MRNA') to include when creating transcripts for a RefFlat file.

Value

A data frame in refflat format with the following columns: - `geneName`: Gene name (based on the `geneName` parameter) - `name`: Gene ID (based on the `name` parameter) - `chrom`: Chromosome identifier - `strand`: Strand information - `txStart`: Start position of the transcript - `txEnd`: End position of the transcript - `cdsStart`: Start position of the coding sequence (CDS) - `cdsEnd`: End position of the CDS - `exonCount`: Number of exons in the transcript - `exonStarts`: Comma-separated list of exon start positions - `exonEnds`: Comma-separated list of exon end positions

Details

The function processes input genomic data in parallel using multiple CPU cores. It filters and processes annotations for transcripts and exons, ensuring that overlapping regions are resolved. Transcripts and exons are matched, and the resulting structure is formatted in the refflat style.

Examples

# Run the function
refflat_data <- refflat_create(input, geneName = 'gene_name', name = 'gene_id', genetic_elements = c("CDS", 'GENE', 'MRNA'))
#> Error in refflat_create(input, geneName = "gene_name", name = "gene_id",     genetic_elements = c("CDS", "GENE", "MRNA")): unused argument (genetic_elements = c("CDS", "GENE", "MRNA"))