Genepattern file format
The conversion is done using one of the following algorithms:. Samples can be annotated by specifying a CLM file. A CLM file allows you to change the name of the samples in the expression matrix, reorder the columns, select a subset of the scans in the input ZIP file, and create a class label file in the CLS format. A CLM file allows you to specify the sample names explicitly. Additionally, the columns in the expression matrix are reordered so that they are in the same order as the scan names appear in the CLM file.
For example, the input ZIP file contains the files scan1. The CLM file could contain the following text:.
The column names in the expression matrix would be: sample3, sample1, sample2. Alternatively, this field may be set to ALL , indicating that the input expression dataset is to be projected to all gene sets defined in the specified gene set database s.
Supported methods are rank , log. The default value of 0. The module authors strongly recommend against changing from default. Default: 0. Gene symbols are typically listed in the column with header Name ; however, GCT files containing RNAi data may list the gene symbol name in alternative columns.
Typically these are human gene symbols. The CollapseDataset GenePattern module can make this transformation. Each line in the text file contains a single filename.
Typically this file is generated by the GenePattern ListFiles module and is used when projecting expression data onto gene sets defined across multiple gene sets database files. No duplicate gene set names are allowed across the listed gene sets database files. Task Type: Projection. The formats are identical other than the separation character tab or comma.
The CHIP file format is organized as follows:. The CLS file format defines phenotype class or template labels and associates each sample in the expression data with a label. It uses spaces or tabs to separate the fields. The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes:.
Note: Most GenePattern modules are intended for use with categorical phenotypes. Therefore, unless the module documentation explicitly states otherwise, a CLS file should define categorical labels. Categorical labels define discrete phenotypes for example, normal vs tumor. For categorical labels, the CLS file format is organized as follows:.
Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest gene neighbors. A CLS file that defines continuous labels can contain one or more labels.
The following example shows a CLS file that defines two continuous labels:. For a continuous phenotype label, the values for the samples define the phenotype profile. The relative change in the values defines the relative distance between points in the phenotype profile. In the example shown above, the phenotype profile is the expression profile for a gene: the sample values for the two phenotype labels are gene expression values.
For a time series experiment, you would choose sample values that define the desired expression profile. The example shown below assumes that you have five samples taken at 30 minute intervals. The first phenotype label defines a phenotype profile that shows steadily increasing gene expression; the second defines a profile that shows an initial peak and then gradual decrease:.
This is a tab-delimited file format that contains SNP copy numbers. It is organized as follows:. Note: Sort the SNPs by chromosome and physical position low to high. Most GenePattern modules, as well as many external tools, require sorted data. The International Society for Advancement of Cytometry ISAC provides detailed resources outlining flow cytometry data file format standards , including for updated FCS formats, as well as example data transformations. Check module documentation to see which versions of FCS files the module accepts.
For more information on each of these and other Cufflinks suite file types, see the Cufflinks website. Each FPKM tracking file has the following format:. Cuffdiff calculates the expression and fragment count for each transcript, primary transcript, and gene in each replicate prior to differential expression calculations. The results are output in read group tracking files at the level of genes, isoforms, transcription start sites, and coding sequences.
The GCT file format is a tab delimited file format that describes an expression dataset. Most modules do not allow missing expression values. The GCT file is organized as follows:. Occasionally, GCT files are organized in a transposed structure where the columns represent genes and the rows represent samples.
The user should take care to check the organization of the file to ensure that the correct preprocessing is performed on the file. This is a tab-delimited file format that contains the output results of the GLAD module. The GLAD file format is organized as follows:. The GMT format is more convenient for storing larger databases of gene sets.
The GMT format contains a row for each gene set:. The GMX format contains a column for each gene set:. The number of rows and columns should agree with the number of rows and columns specified on line 2. Each row contains a name, a description, and an intensity value for each sample.
Names and descriptions can contain spaces, but may not be empty. Intensity values may be missing. To specify a missing intensity value, leave the field empty Line format: gene name tab gene description tab col 1 data tab col 2 data tab The RES file format is a tab delimited file format that describes an expression dataset.
It is organized as follows. The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Line format: Description tab Accession tab sample 1 name tab tab sample 2 name tab tab The second line contains a list of sample descriptions. Currently, GSEA ignores these descriptions.
Our RES file creation tool places the sample data file name and scale factors in this row, as shown below. Line format: tab sample 1 description tab tab sample 2 description tab tab The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file.
Line format: of data rows. For example: There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes names and descriptions can contain spaces since fields are separated by tabs.
The description field is optional but the tab following it is not. The PCL file format is a tab delimited file format that describes an expression dataset. For more information, see Stanford pcl file format.
The TXT format is a tab delimited file format that describes an expression dataset. The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset.
NOTE: The Description column is intended to be optional, but there is currently a bug such that it is treated as required. We hope to fix this in a future release.
0コメント