

Rapid Database Search for Sequence Similarity
October 16, 2008 Doug Brutlag Homework Assignment Number 5
In this homework you will perform three searches on a family of proteins and analyze the results. Use a member of the family of pyruvate dehydrogenase E1 component with enzyme commission number EC 1.2.4.1 for your gold standard. You may use either subunit alpha (45 examples in UniProt/SwissProt) or a subunit beta protein (also 45 examples in UniProt/SwissProt) and you may choose the protein from any species you wish as your query. Do NOT use one of the bacterial pyruvate dehydrogenases that have only a single subunit (>800 amino acids). You can use the ExPASy SRS Search server to retrieve a list of all SwissProt sequences belonging to that family for your gold standard. Choose one such sequence as a query and use the Decypher supercomputer to perform UnGAPPED Tera-BLAST, Banded SW Tera-BLAST and standard Smith-Waterman algorithm searches of the UniProt/SwissProt database. Make sure that in each case you collect at least 100 sequences in your result set (twice the expected number of family members). Also be sure to turn query filtering OFF (the default is ON).
Now, compare the three searches using the Receiver-Operator- Characteristic (ROC) curve. For each search, draw a line across each output list after every 10th sequence and count the number of true positives above that line and the number of false positives above that line. Remember that the gold standard determines whether a sequence is a true positive. Continue until you have collected at leat 50 false positives (ROC50 curve). Finally, plot the number of True Positive sequences versus the number of False Positive sequences on a two dimensional graph for each search.
Do the three searches have identical shaped curves? Is one curve higher than the other? If so, which search is the best for the pyruvate dehydrogenase family?
Due October 23, 2008