Melanie Bahlo-Projects

Melanie Bahlo-Projects

Projects

In-silico gene prioritisation using brain specific gene expression data

Following on from our published research into epileptic encephalopathy genes, we have been applying data cleaning methods to several large, complex, brain specific gene expression data sets. Removal of artifacts from these precious data sets will allow us to use the data to prioritise discovered variants in gene discovery projects. These genes can be taken forward in collaborative studies for further examination.

Team members: Dr Saskia Freytag, Vesna Lukic, Karen Oliver

Reference: Oliver KL, Lukic V, Thorne NP, Scheffer I, Berkovic S, Bahlo M. Harnessing gene expression networks to prioritize candidate epileptic encephalopathy genes. PLoS One. 2014 Jul 9;9(7):e102079. PMID: 25014031

Comparison of different microarray data cleaning methods in the context of gene-gene correlations using simulated data.
Comparison of different microarray data cleaning methods in the context of gene-gene correlations using simulated data. The simulated data consisted of 1000 arrays with 3000 measured gene expressions obscured by moderate noise. In each panel the correlations above the diagonal represent the true underlying correlations between the genes, whereas the correlations underneath the diagonal represent the estimated correlations from the data treated with a particular cleaning procedure. The first panel shows correlation estimated from the untreated data, the second panel shows the effect of removal of unwanted estimation procedure using 2000 negative control genes. The last two panels focus on commonly applied methods such as background correction and background correction plus quantile-normlization. In each panel the first six genes  are strongly expressed genes, the second set of six genes have low expression and the last six genes are not expressed and are thus uncorrelated. It is apparent that removal of unwanted variation is the only procedure able to recover the true gene-gene correlations in noisy data.

 

Analysis methods for cell-free DNA for the detection of foetal anomalies and transplant rejection

Following on from our published research into plasma DNA sequencing we have been developing and applying refined methods for the description and correction of read coverage bias in next-generation sequencing data. These corrections lead to more sensitive detection of changes in cell-free DNA profiles, allowing the detection of foetal genomic abnormalities and potentially allowing the detection of transplant rejection, based on DNA extracted from blood samples.

This plot depicts the result of a cross-correlation analysis on the sequencing reads of two cell-free DNA samples.
This plot depicts the result of a cross-correlation analysis on the sequencing reads of two cell-free DNA samples. This pattern is evidence that DNA fragments occur in regularly placed clusters along the genome. The interval length of ~190bp between the correlation peaks corresponds to the distance between nucleosomes. 

 

Team members: Dineika Chandrananda, Peter Diakumis

Reference: Chandrananda C, Thorne NP, Ganesamoorthy D, Bruno DL, Benjamini Y, Speed TP, Slater HR, Bahlo M. Investigating and correcting plasma DNA sequencing coverage bias to enhance aneuploidy discovery. PLoS One. 2014 Jan 29;9(1):e86993. PMID: 24489824

Discovery of expanded repeats with whole genome sequencing data

We know of over twenty neurological diseases caused by expansions of short repetitive runs of DNA. This includes several causes of ataxia and Huntington’s disease. It is very likely that there are other neurological diseases also caused by repeat expansions, however these are very difficult to discover due to their repetitive nature, requiring new methods to identify. We are developing tools to scan through databases of known short repeat loci to identify individuals that show evidence of expansions and will apply these to cohorts of unsolved patients with genetic disorders.

Team member: Rick Tankard

Identification of identity by descent relationships with DNA data

Identity by descent (IBD) describes the genetic relationship between individuals. IBD can be used to infer hidden relationships. These relationships, when found, form the basis of many of our disease variant discovery methods. We have implemented and extended these methods to apply to X chromosome and next-generation sequencing data. This has led to the discovery of disease causing variants in intellectual disability and we are extending these methods for cohort IBD discovery as well as inherited copy number variant discovery.

Team members: Lyndal Henden, Dr Thomas Scerri

Detection of an IBD tract on the X chromosome in a pair of supposedly unrelated individuals, determined using posterior probability and the Viterbi algorithm.
Detection of an IBD tract on the X chromosome in a pair of supposedly unrelated individuals, determined using posterior probability (dots) and the Viterbi algorithm (solid line). The dotted line in the graph shows the location of a gene in which a novel, identical mutation was found in the pair that is suspected casual for their X-linked intellectual disability.

 

Identifying genes for rare Mendelian forms of epilepsy

It is now well recognised that genetic factors play an important role in epilepsies. The list of causative variants for rare Mendelian forms of epilepsy is rapidly growing since massive parallel sequencing has become available. However, the causative genes in most epilepsies remain elusive.  

By utilising familial and population information, we aim to identify new epilepsy genes. We also aim to investigate noncoding and structural variants in epilepsies, with the view to increase diagnosis rate. 

Reference: Liu YC, Lee JW, Bellows ST, Damiano JA, Mullen SA, Berkovic SF, Bahlo M, Scheffer IE, Hildebrand MS; Clinical Group. Evaluation of non-coding variation in GLUT1 deficiency. Dev Med Child Neurol. 2016 Dec;58(12):1295-1302. PMID: 27265003 

Team members: Mark Bennett, Yu-Chi Liu, Karen Oliver

Building software to help discover genes involved in central nervous system disorders

Advancing our knowledge of the molecular mechanisms causing central nervous system disorders enhances medical treatment and helps in the search for cures. However, costly studies are typically needed to identify molecular mechanisms, such as genes, involved in these disorders and success is never guaranteed.

Study success can be improved and therefore costs can be reduced when the normal functioning part of the central nervous system that is studied is well understood.

To facilitate better molecular understanding, we develop user-friendly and interactive software that allows the interrogation of genes and their relationships in the central nervous system. For example, clinicians and biologists use our software brain-coX to examine relationships between genes. This can be used to assess the chances of a gene of interest to be involved in a central nervous system disorder.

We are currently extending this work to allow the interrogation of genes and their relationships in all cell types of the central nervous system. 

Team members:  Saskia Freytag, Karen Oliver, Santiago Martinez 

Reference: Freytag S, Burgess R, Oliver KL, Bahlo M. brain-coX: investigating and visualising gene co-expression in seven human brain transcriptomic datasets. Genome Med. 2017 Jun 8;9(1):55. doi: 10.1186/s13073-017-0444-y PMID: 28595657

Genetics of speech disorders

One in five Australian children start school with a speech or language disorder. While some children will grow out of it, many others will go on to have persistent speech difficulties. Such disorders can have a profound effect on an individual’s social and mental wellbeing.

Speech disorders are thought to be caused by a combination of genetic, neurological and environmental factors. Understanding more about the genetic causes of speech disorders may improve developments in treatment and help us to identify individuals most at risk of these disorders. 

In this project, we will be taking several approaches to investigate the genetics underlying speech problems. Through whole exome and whole genome sequencing (WES/WGS) of families, we are seeking to identify causal variants responsible for rare forms of familial speech disorders. We are working to assemble an Australian stuttering cohort, with which we shall undertake a genome-wide association study (GWAS) to identify common genetic variation, influencing risk of stuttering in the general population. We also intend to develop methods for incorporating evolutionary information to identify genomic regions that are relevant to speech related traits.

Team member: Victoria Jackson 

Reference: Liégeois FJ, Hildebrand MS, Bonthrone A, Turner SJ, Scheffer IE, Bahlo M, Connelly A, Morgan AT. Early neuroimaging markers of FOXP2 intragenic deletion, Scientific Reports (6), 35192. http://doi.org/10.1038/srep35192 PMID: 27734906

Genetics of autism spectrum disorders

Autism spectrum disorder (ASD) is a neurodevelopmental disorder evident from early childhood and is characterised by impairment in social communication and interactions, repetitive behaviour or speech, rigid routines and restricted interests. ASD affects up to 1 in 68 individuals and is more often diagnosed in boys compared to girls.

More than 300,000 Australians are estimated to be living with ASD, placing a large emotional burden on affected families, and is a major economic cost on society.  ASD presents as a broad spectrum, and can range from severe cognitive or social impairment, to less severe impairment with normal intellect, referred to as high functioning ASD (HFA).

While there is no single cause of ASD, strong heritability identified in twin and family studies suggests a predominant genetic basis for the condition. While some more severe cases of ASD are often associated with single de novo gene mutations, the genetic bases of HFA tend to involve more complex inheritance patterns. Indeed, as many as 70% of ASD diagnosis have unknown aetiology.  This presents a major challenge to the identification of potential ASD susceptibility genes.  

As part of the Collaborative AuTism Study (CATS), we have utilised extensive clinical phenotyping to identify families with high incidence of HFA, as well as individuals who are classified as having a ‘broader autism phenotype’ (BAP), which is characterised by milder ASD traits. Characterising individuals with BAP and including them as markers for genes implicated in ASD in large families improves the likelihood for gene discovery. We will combine this approach with Whole Genome Sequencing (WGS), with the view to identify novel causal variants involved in the development of ASD.   

Team members: Dr Haloom Rafehi

Understanding the systems biology of neurogenetic diseases using machine learning

As analytical technologies improve and become more widely used, ever-increasing amounts of data are being produced from family and cohort studies of various neurogenetic diseases including epilepsy and schizophrenia.

Many of these diseases do not have clear individual causes, so we routinely capture information about broad swathes of biological processes by gathering information such as genetic sequence, chromatin accessibility, epigenetic modifications, and expression levels. The sheer quantity and complexity of data needs us to develop new approaches to its management and interrogation as it is constantly expanding and changing. 

We are using cutting edge techniques from the machine learning and artificial intelligence fields and the Institute’s new high-performance specialist computing facility to gain insight into the causes of these complex neurological diseases.

We are building semi-autonomous machine learning systems capable of coping with the volume of data produced daily in laboratories across the world, and keeping themselves up to date in this constantly shifting landscape of information. These models of disease will allow us to refine our understanding of complex neurogenetic diseases in a way that maximises our use of the ever-expanding data. 

Team member:  Dr Liam Fearnley

Discovering the role of genetics behind macular telangiectasia type 2

Macular telangiectasia type 2 (MacTel) is an often misdiagnosed degenerative eye disease that may result in blindness. There is currently no available cure for MacTel disease.

The disease is thought to have a string genetic contribution but this is likely to be complex. Through a genome-wide association study (GWAS) we recently discovered the first five genetic loci involved in this disease, some of which belong to the glycine/serine metabolism pathway.  

In this project we will collect and co-analyse diverse ‘omics data. So far our team has cross-analysed, SNP chip, whole DNA sequencing, metabolomics, gene expression and phenotypical data on MacTel disease. Our aim is to discover the role of the discovered genetic loci on MacTel and its progression, as well as to discover the genetic and metabolomic mechanism that affect the disease. Our results will hopefully lead to better prognosis and possibly future treatments. 

Team members:  Roberto Bonelli, Brendan Ansell 

Reference:  Scerri TS, Quaglieri A, Cai C, Zernant J, Matsunami N, Baird L, Scheppke L, Bonelli R, Yannuzzi LA, Friedlander M; MacTel Project Consortium, Egan CA, Fruttiger M, Leppert M, Allikmets R, Bahlo M. Genome-wide analyses identify common variants associated with macular telangiectasia type 2. Nat Genet. 2017 Apr;49(4):559-567. Epub 2017 Feb 27. PMID: 28250457

Scientific figure showing network of metabolomics connection

Image: Network of metabolomics connection and disease relevance.  

In this image more than 800 metabolites and their connection with each is displayed. Connections are displayed as blue lines (positive connection) or red lines (negative connection). We performed a stratified factorial analysis to explore and create potential clusters of metabolites which joint effects might increase the risk of developing MacTel disease. In this image each cluster is identifiable by their different colors. Metabolites that appear to be closer in this network should belong to the same cluster. However, we noticed that some clusters (for example the red or blue clusters) appear to be scattered, indicating that the metabolites composing these clusters tend to have a joint effect in multiple metabolomics pathways. 

The size of each dot indicates the relevance of each metabolite on the disease risk.

Glycine and serine identified in our genetic study appear in this network as the greatest points of the blue cluster, confirming the importance of their role on the disease. 

Identity by descent analysis of microorganisms

Genomic regions that are inherited from a common ancestor are said to be identical by descent (IBD). Identification of such regions has proven useful in human studies with application including discovery of familial relatedness, disease mapping and determining loci under selection, however little work has been done on IBD analysis of microorganisms that cause disease. This is in part due to the lack of methodologies for non-diploid species in addition to the occurrence of multiple, genetically-distinct clones that are present within an infection.  

In this project, we have re-developed an algorithm that was used to infer IBD on the human X chromosome, implemented in XIBD, for IBD analysis of haploid microorganisms that undergo recombination.

We have released this algorithm in the R package isoRelate and have had much success analysing real datasets. In particular, we have performed an IBD analysis of the global Plasmodium falciparum Pf3k dataset and have been able to explore file-scale population structure as well as identify loci under selection using isoRelate. 

Team member: Lyndal Henden 

Reference: Henden L, Lee S, Mueller I, Barry A, Bahlo M. Detecting selection signals in Plasmodium falciparum using identity-by-descent analysis. 2016 BioRxiv doi: 10.1101/088039

Chart depicting network of P. falciparum isolates

Image: Relatedness network for pairs of P. falciparum isolates that are related over the chloroquine resistance transporter gene, Pfcrt

Each node represents a unique P. falciparum isolate, and a line is drawn between two isolates if they were inferred either partially or completely IBD over the gene Pfcrt. Isolates with a single infection - that is, multiplicity of infection (MOI) of 1 - are represented by circles while isolates with multiple infections (MOI > 1) are represented by squares. Here we see that many isolates from both Southeast Asia and Africa are IBD over Pfcrt, which is consistent with literature that suggests a haplotype conferring resistance to the antimalarial drug chloroquine has spread between Southeast Asia and Africa. 

Adaptation of HipSTR to malaria whole-genome sequencing data

Genotyping of short tandem repeats (STRs) have proven difficult in a variety of whole-genome sequencing (WGS) applications due to a number of factors. One of these factors is stutter noise within a STR locus due to polymerase chain reaction (PCR) amplification which results in reads displaying an incorrect number of repeats.

The HipSTR algorithm is intended to successfully deal with these issues to allow accurate genotyping of STRs from WGS data in addition to possessing other useful features.

This project will be to adapt the HipSTR algorithm for malaria WGS data, which will take into account the prevalence of multiplicity of infection in malaria samples from Papua New Guinea.

Adaptation of the HipSTR algorithm will involve many adjustments, one being modifying the expected-maximization algorithm which is involved in learning a PCR stutter model for STR loci.  

Team member:  Anthony Kocoski  

Reference:  Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nature Methods, 14(6):590–592, June 2017 PMID: 28436466

 

Data representation of project outcomes

Detecting repeat expansions

Short tandem repeats are short repetitive elements of the genome, which can vary in length between individuals. Some repeats are unstable and can expand in length. Repeat expansions cause a number of neurological disorders, such as Huntington's disease and spinocerebellar ataxias. Identifying repeat expansions is difficult as their length can greatly exceeds the read lengths of short read sequencing. Standard clinical tests are specialised and expensive and not routinely performed for the majority of patients.

Our lab has developed a new method to identify repeat expansions in whole exome and whole genome sequencing data (Tankard et al 2017). We are interested in searching for known or novel repeat expansions associated with a variety of neurological disorders. 

Team members: Mark Bennett, Peter Degorski 

Reference: Tankard RM, Delatycki MB, Lockhart PJ, Bahlo M. Detecting known repeat expansions with standard protocol next generation sequencing, towards developing a single screening test for neurological repeat expansion disorders BioRxiv doi: 10.1101/157792

Chart showing repeat expansion disorder

Image: Detecting repeat expansions  

The statistical method we have developed identifies samples with repeat expansions from short read sequencing data. Samples likely to be affected by the repeat expansion disorder have an increased number of repeated bases and appear shifted to the right, which can be seen for the coloured samples in the figure above. 

Dating rare mutations

Rare genetic mutations shared by multiple individuals often have their origins in a de novo mutation inherited from a single common ancestor. Recently, we developed a novel method for estimating the age of a rare mutation based on ancestral haplotype sharing between mutation carriers. This method is based on the assumption that older mutations will share less DNA around the common mutation compared to younger mutations (i.e. the Gamma method). 

To date, sharing of ancestral haplotypes are determined manually, which is time consuming and subject to human interpretation and error. We are currently developing a heuristic method to computationally determine the lengths of shared ancestral haplotypes, with the view of providing a more robust and consistent approach to the dating of rare mutations bases on ancestral sharing.  

Team member: Haloom Rafehi 

Reference: Gandolfo LC, Bahlo M, Speed TP. Dating rare mutations from small samples with dense marker data. Genetics. 2014 Aug;197(4):1315-27. Epub 2014 May 30. PMID: 24879464

Scientific figure