We recommend the default MAS5.0 normalization steps with Entrez BrainArray Custom CDF. Final log-transformation is also recommended.
We recommend /the Bowtie and Tophat alignment algorithm with NCBI's transcript reference.
We use quantile transformation in order to compute hgu133plus2-like expression values. The hgu133plus2 reference was constructed from 1000 random samples. This step is automatically taken after submission.
URSA(HD) expects a two column text file where the first column has Entrez ids (e.g. 672 or 672_at for BRCA1) or HGNC gene official names (e.g BRCA1) and the second column has the corresponding quantified expression values.
This mapping file contains all gene names with Entrez ids that we use for processing.
The files below are the example files that can be uploaded to can be URSA(HD)
100009676_at 7.07449740361025 10000_at 7.53465882722509 10001_at 9.53297503541572 10002_at 6.6416331851118 10003_at 3.59384487863744 100048912_at 4.56957162092669 100049716_at 7.98030181600126 10004_at 7.99370860814693 10005_at 9.65020794108123 10006_at 11.3924969710799 ...
10000_at 8.76654043343922 10001_at 8.36022155384614 10002_at 6.06562333136321 10003_at 8.04370245997055 100048912_at 9.22027663827746 10004_at 4.20938555580617 10005_at 9.70403314016829 10006_at 7.86696236405382 10007_at 7.99204011370414 10009_at 6.45513387024632 ...
LOC100506869 0.000000 LOC100506865 0.000000 MTVR2 0.000000 LOC100506867 0.142549 LOC100506860 0.755204 LOC100506862 0.171215 ATRX 4.968670 LOC147670 0.344138 LOC100506866 0.132153 LOC441204 1.253760 ...
If you need help processing your raw files, please let us know at function@genomics.princeton.edu
Results for one user expression profile are returned at a time. To compare molecular signals between expression profiles, we provide an email with a link to all of your results which can be opened and viewed simultaneously.
For diseases that are not included in the URSAHD training set, theoretically, URSAHD should make “no calls”. The SVM margins from each URSAHD disease model would be very small and thus not informative for the Bayesian network - leading to posterior probabilities close to the prior. That being said, we do believe that most diseases are related to a certain extent. So in practice, the wide disease coverage of URSAHD training set could lead to detecting related-disease signals in this "novel" disease sample.
Area-under-precision-recall-curve (AUPRC) of each URSAHD disease models are available here: whole-evaluation.tsv
In order to utilize the tissue relationships, gene expression experiments were annotated to a term or terms in the Brenda Tissue Ontology. After an initial substring text-mining of sample descriptions in GEO, term-to-experiment pairs were manually verified based on their sample descriptions and associated publication(s) to exclude incorrect or ambiguous pairs. The associated publication (original paper) was examined only when the sample descriptions were ambiguous. Sample annotations were then propagated based on the tissue ontology. Note that experiments weren’t necessarily annotated to their most specific term in the ontology although such attempts were made.
Manual tissue annotations are available here: manual_annotations_ursa.csv