MSc-PhD Studentecho $email_mailto ?> Phone +82 2 2220 2397
Mr. Schmidt is a PhD candidate, who started working as a research student in spring 2015 and joined the combined Master-PhD program in 2017. He is expected to graduate in February 2021. His research interests are within the topics of algorithms, read-alignments, structural variant (SV) calling and clustering.
He is currently working on a SV caller that uses the insights & infrastructure created in the “A performant bridge between fixed-size and variable-size seeding” and “Accurate high throughput alignment via line sweep-based seed processing” publications.
A performant bridge between fixed-size and variable-size seeding 
Seeding is commonly used as the first stage in read-alignments. This work analyzes several seeding strategies and proposes an approach for computing variable-size seeds from fixed-size seeds. A major outcome is a hierarchy of seed-sets, where fixed-size and variable-size seed sets are compared theoretically and practically. The theoretical comparison shows equivalences between several fixed-size and variable-size seed sets as well as subset relations within the two groups. The practical analysis compares, among others, the time required for computing the seed sets and their rates of “correct information”.
Kutzner, A., Kim, PS. & Schmidt, M. A performant bridge between fixed-size and variable-size seeding. BMC Bioinformatics 21, 328 (2020). https://doi.org/10.1186/s12859-020-03642-y
Accurate high throughput alignment via line sweep-based seed processing (MA – The Modular Aligner) 
Many aligners use the following three-stage paradigm (also called seed-chain-extend paradigm):
- Seeding. Computation of perfect matches (or with few mismatches) using some form of indexing.
- Coupling. Identification of promising seed subsets that could lead to an optimal alignment. A popular technique for this task is chaining.
- Dynamic Programming. Closing of remaining gaps between seeds and extension of the outer end points given by the seeds.
MA stays with this basic pattern, but relies on novel algorithmic solutions for the stages 1 and 2. For stage 1, MA introduces a divide and conquer approach that generates seeds with a higher rate of information compared to other variable-size seed sets as e.g. MEMs and SMEMs. In stage 2 we come up with two new ideas. First, the Strip of Consideration (SoC) and, second, seed harmonization. Both techniques are simple line-sweeps. The SoC is used to identify promising regions on the reference, while the seed harmonization purges contradicting seeds. No specially tailored data structures are required for either approach. Both approaches are very efficient and easy to implement (roughly 50 lines Pseudocode).
Schmidt, M., Heese, K. & Kutzner, A. Accurate high throughput alignment via line sweep-based seed processing. Nature Communications 10, 1939, doi:10.1038/s41467-019-09977-2 (2019)
A novel specialized single linkage clustering algorithm for taxonomy analyses 
Mr. Schmidt, Prof. Kutzner and Prof. Heese developed a novel algorithm for taxonomic analyses. Mr. Schmidt implemented the novel algorithm as well as a plain single linkage approach. Source files as well as the executable jar file of the reference implementation can be obtained here. The application does not perform alignment, it merely loads Klee-diagrams. Some example diagrams can be downloaded here. The files are given in Excel-xml format, so it is possible to inspect the raw data using Microsoft Excel. The algorithm is intended to enable big-data analyses of orthologue genes, by breaking the time complexity of general single linkage clustering. This is achieved by exploiting the taxonomic order of species.
Schmidt M, Kutzner A, Heese K. A novel specialized single-linkage clustering algorithm for taxonomically ordered data. J Theor Biol. 2017 Aug 1;427:1-7. doi: 10.1016/j.jtbi.2017.05.008. Epub 2017 May 15. PMID: 28522359.