create_GTF_df.Rd
The `create_GTF_df` function processes a GTF (Gene Transfer Format) input dataset to extract and optimize gene, transcript, and associated annotations. It handles various GTF formats such as those from GENCODE, Ensembl, NCBI, or custom sources. The function also resolves duplicated gene names, unifies annotations, and provides flexibility for optimization. Data must be pre-loaded using load_annotation() and followed by sort_alias().
create_GTF_df(input, optimize = TRUE, shift = 1e+05)
A data frame containing GTF data. The GTF file must be pre-loaded as a data frame and should have at least 9 columns with annotation data in the 9th column.
Logical (default: TRUE). If `TRUE`, the function performs optimization steps including filling missing annotations, removing redundant rows, and unifying gene names.
Numeric (default: 100000). Determines the threshold for resolving duplicated gene names based on genomic locus proximity.
A processed data frame with standardized GTF fields, including: - `gene_name` - `gene_id` - `transcript_id` - `transcript_name`
Additional columns depend on the input data frame structure.
This function is designed to handle different formats of GTF annotations: - **GENCODE/Ensembl**: Extracts `gene_id`, `gene_name`, `transcript_id`, and `transcript_name`. - **NCBI**: Extracts `GeneID`, `gene_name`, and `GenBank` transcript identifiers. - **Custom Format**: Parses annotations containing `gene:` and `transcript:` prefixes.
The function includes: - Optimization to merge missing or inconsistent annotations across the dataset. - Detection and repair of duplicated gene names appearing at different loci or strands. - Parallelized processing for improved performance.
# Process the GTF data
processed_gtf <- create_GTF_df(gtf_data, optimize = TRUE, shift = 50000)
#>
#>
#> GTF converting...
#>
#> Error: object 'gtf_data' not found