MaizeSequence.org FTP Site ========================== This site provides access to the latest sequenced maize data. The site is part of the NSF-funded Maize Genome Sequencing Project. The directories are named using the 'YYYYMMDD' date format, and their content includes sequence data as of the indicated date. The 'current' directory points to the most current data directory. +----------------------------------------------+ | Release 3b.50 | +--------------------------+-------------------+ | Freeze Date | October 31, 2008 | | Maize BAC clones | 16,587 | | Maize BAC contigs | 185,231 | | Contigs per BAC | 11.2 | | Sequence Length | 2,778,853,373 bp | +--------------------------+-------------------+ | Evidence-based genes | 90,829 | | Fgenesh genes (unmasked) | 514,467 | | Protein-coding genes | 59,052 | | Hypothetical genes | 86,201 | | Transposon-like genes | 369,214 | | Fgenesh genes (masked) | 81,453 | | Protein-coding genes | 54,606 | | Hypothetical genes | 18,352 | | Transposon-like genes | 8,495 | +--------------------------+-------------------+ | Working Gene Set* | 113,672 | | Evidence-based genes | 90,829 | | Fgenesh models | 22,843 | | Transcripts | 148,278 | | Filtered Gene Set | 45,046 | | Evidence-based genes | 40,354 | | Fgenesh models | 4,692 | | Transcripts | 75,388 | +--------------------------+-------------------+ * - The Working Gene Set: A set of genes has been defined as the entire set of evidence-based genes (predicted by Gramene GeneBuilder) that is then complemented by a set of Fgenesh models, predicted on masked DNA sequence, that does not overlap with the loci of the evidence-based genes. The filtered set was generated by screening the working set to remove pseudogenes, TE- encoded genes, and low-confidence hypothetical models. ==== CONTENTS ==== All sequence files are compressed as bzip2 archives and are available in: sequences/ Clone Sequences --------------- The following raw DNA sequences are available for the entire maize genome: bacs.fasta.bz2 Accessioned BACs, as retrieved from GenBank bacs_masked.fasta.bz2 Masked accessioned BACs bac_contigs.fasta.bz2 Individual intra-BAC contigs bac_contigs_masked.fasta.bz2 Masked intra-BAC contigs The format of each sequence comment within the raw DNA files is: >SEQUENCE_ID COORD_SYSTEM:VERSION:SEQUENCE_ID:START:END:STRAND The SEQUENCE_ID is the unique accessioned ID of the sequence. In the case of clones, it is the precise GenBank versioned accession; in the case of contigs it is the accession with a contig number suffix, as indicated in the GenBank record. The rest of the comment is sequence meta data that is used internally. Of note, the COORD_SYSTEM indicates whether the sequence is a clone or a contig, and the END also corresponds to the length of the sequence (since START is always 1). Gene Sequences -------------- Various classes of genes are provided as sequence dumps. For each type, the following dumps are available: [type]_genes.fasta.bz2 The genomic sequences of the genes [type]_cds.fasta.bz2 The coding regions defined for gene transcripts [type]_pre_mrna.fasta.bz2 Annotated transcript structure within the pre-spliced genes; exons are in uppercase; introns are in lowercase [type]_translations.fasta.bz2 The translated peptide products of the gene transcripts [type]_exons.fasta.bz2 The gene exons The following types of gene classes are available: fgenesh_masked Fgenesh models predicted on masked sequence evidence Gramene GeneBuilder genes, based on supporting evidence working_set Working Gene Set (* see above) filtered_set Filtered Gene Set (* see above) The format of each sequence comment within the gene files is: >OBJECT_ID CLONE:START:END:STRAND:ANALYSIS:CLASSIFICATION OBJECT_ID The unique ID of the gene, transcript, translation, or exon CLONE, START, END, STRAND Locus information about the specific object ANALYSIS The prediction method used to annotate this object CLASSIFICATION The assigned class based on homology to peptides in the NR database: protein_coding Significant homology to a known non-TE protein transposon_pseudogene Significant homology to a known TE protein protein_coding_unsupported No significant homology, i.e., hypothetical Reports ------- fpc_report.txt - A table describing which clones on the agarose FPC map have been accessioned (sequence present) evidence_genes_fpc_mappings.txt - A table showing the FPC location/annotation of BACs on which evidence-based genes were called protein_coding_fpc_mappings.txt - A table showing the FPC location/annotation of BACs on which protein-coding genes were called Other Files ----------- gff/ - GFF3 dumps of BAC features, as Bzip2 archives gene_models.gff.bz2 - Combined ab initio and evidence-based genes, with underlying gene structure working_gene_set.gff.bz2 - The Working Gene Set (* see above) repeat_features.gff.bz2 - MIPS/REcat features, annotated with RepeatMasker cereal_alignments.gff.bz2 - BLAT alignments of same- and cross-species libraries mysql/ - SQL dumps of the Ensembl maize databases used by the browser, as Bzip2 archives zea_mays_core_50_bac_3b.sql.bz2 - BAC sequences and underlying annotations zea_mays_core_50_fpc_3b.sql.bz2 - The maize agarose FPC map zea_mays_core_50_bac_external_3b.sql.bz2 - Maize BAC sequences produced by prior projects software/ - Software related to the maize project maize-ensembl.tar.gz - The source code of the Maize Ensembl plugin ==== CHANGES ==== 3b.50 - * Fgenesh models that have been previously computed on unmasked maize sequence have been removed from the FTP site. There is a shift in strategy to call ab initio models on sequences masked with MIPS repeats as they produce a more reliable set of genes. * Due to space constraints, all FASTA sequence files are now only available as compressed bzip2 (http://www.bzip2.org) archives. * All gene FASTA sequences are now supplied with new CDS sequences. * All *_transcripts files were renamed to *_pre_mrna to reflect more accurately their content. These files contain the gene genomic sequences, where letter-case indicates the structure: exons are in uppercase while introns are in lower-case. * New gene sequence dumps: filtered_set, fgenesh_masked * GFF dumps available for gene types: gene_models, working_set, filtered_set, fgenesh_masked NOTE ABOUT GENE TRANSLATIONS: In the case of Fgenesh predictions (te-like, protein-coding, and hypothetical), a small number of genes were predicted with stop-codons. This is an artifact of Fgenesh predicting short genic fragments, often with singleton exons. We do not include such translations in the file dumps. Our website is located at: http://maizesequence.org For more information, please contact us at: info@maizesequence.org ---- Last updated 2009-02-22