The "analysis.R" script can be used to reproduce the computations, it will compute everything from
scratch (collecting SNPs, feature selection, training and testing of the model, GWAS analysis and
enrichment (Table S1)) and will create most of the figures of the paper and supplementary. If you
are interested in the final model matrix and don't want to run the script, you can download it as
a tab-separated text file here (rows are observations, columns are response ('y') and features, have
a look at the colnames):

https://github.molgen.mpg.de/budach/miRNA_eQTL/

#######################

Other data files contained in this supplementary file required to run the script successfully:

- cutting_points.bed:          Drosha 5' and 3' cuting points
- disease_categories.txt:      GWAS traits classifications
- host_mirna_independence.txt: results of the 'MiRNA – Host Gene Independence Analysis'
- hostgenes_mirna.txt:         host gene annotations based on Gencode release 19
- id_mapping.txt:              derived from ftp://mirbase.org/pub/mirbase/20/miRNA.dat.gz

If you want to run the script, you need the following additional files:

- promoter predictions: download File S1 and unpack the file
- 1K Genomes variants:  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz
- GWAS catalog:         https://www.genome.gov/admin/gwascatalog.txt (we downloaded this on 21.04.2015)
- miRBase files:        ftp://mirbase.org/pub/mirbase/20/genomes/hsa.gff2
                        ftp://mirbase.org/pub/mirbase/20/genomes/hsa.gff3
- ChromHMM:             http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeBroadHmm/wgEncodeBroadHmmGm12878HMM.bed.gz
- DNase:                ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromDnase/wgEncodeOpenChromDnaseGm12878Pk.narrowPeak.gz
- TFBS:                 http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRegTfbsClustered/wgEncodeRegTfbsClusteredWithCellsV3.bed.gz
- eQTL data miRNAs:     http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/EUR363.mi.cis.FDR5.all.rs137.txt.gz
                        http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/YRI89.mi.cis.FDR5.all.rs137.txt.gz
                        http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/GD480.MirnaQuantCount.txt.gz
- eQTL data hosts:      http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/EUR373.exon.cis.FDR5.all.rs137.txt.gz
                        http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/EUR373.gene.cis.FDR5.all.rs137.txt.gz
                        http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/YRI89.exon.cis.FDR5.all.rs137.txt.gz
                        http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/YRI89.gene.cis.FDR5.all.rs137.txt.gz

With the following configuration we could run the script successfully, also the zcat command line
tool must be available:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: MarIuX64 2.0 GNU/Linux 2010-2012

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C          
 [8] LC_NAME=C            LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.0.0           Matching_4.8-3.4          MASS_7.3-45               ROCR_1.0-7                gplots_2.17.0            
 [6] corrplot_0.73             ggplot2_1.0.1             VariantAnnotation_1.14.13 Rsamtools_1.20.5          Biostrings_2.36.4        
[11] XVector_0.8.0             GenomicRanges_1.20.8      GenomeInfoDb_1.4.3        IRanges_2.2.9             S4Vectors_0.6.6          
[16] BiocGenerics_0.14.0       data.table_1.9.6         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2             futile.logger_1.4.1     plyr_1.8.3              GenomicFeatures_1.20.6  bitops_1.0-6            futile.options_1.0.0   
 [7] tools_3.2.0             zlibbioc_1.14.0         biomaRt_2.24.1          digest_0.6.8            RSQLite_1.0.0           gtable_0.1.2           
[13] BSgenome_1.36.3         DBI_0.3.1               proto_0.3-10            rtracklayer_1.28.10     stringr_1.0.0           caTools_1.17.1         
[19] gtools_3.5.0            Biobase_2.28.0          AnnotationDbi_1.30.1    XML_3.98-1.3            BiocParallel_1.2.22     gdata_2.16.1           
[25] reshape2_1.4.1          lambda.r_1.1.7          magrittr_1.5            scales_0.3.0            GenomicAlignments_1.4.2 colorspace_1.2-6       
[31] labeling_0.3            KernSmooth_2.23-14      stringi_1.0-1           RCurl_1.95-4.7          munsell_0.4.2           chron_2.3-47
