Description of contents:
-------------------------------
The subdirectories contain .tar files which expand to contents as described below. The data tar files are g-zipped as well since that reduces the file size substantially. For the convenience of those who do not wish to download and manipulate large files, the data used to create the figures in the main bodyof the paper are assembled in an Excel file (events_distributions.xls) located in the results/ directory. The executable files are Matlab scripts and you will need to install Matlab on your system to run them. Two extensions of the basic Matlab installation are recommended: 1) the Matlab Parallel Computing Toolbox, in order to distribute jobs over multiple cores, if available, and 2) Matlab Coder in order to compile selected routines into C for much faster execution. You should create a working directory into which you download and expand the four tar files to create directories working/data, working/genes/, working/results, working/scripts. The Matlab scripts should be executed from the working/ directory: they will then look for input in the data/ and genes/ directories and will create output files in the results/ directory. The initial contents of the directories are as follows: 

data/
Text files containing unique sequence reads, separated into productive and non-productive repertoires. The data is from Harlan Robins, FHCRC (see citation in paper).

genes/
Curated lists of V,D,J gene sequences, and allelic variants. Includes variants from IMGT as well as novel variants found in the Robins et al data. Also included is a list of the primers used, and the RSS sequences for the genes.

scripts/
All the Matlab scripts necessary to replicate our analysis. The workhorse scripts are the following

*VDJ_base_model.m and V_DJ_cuts_insertions_dinuc_ntbias_model.m define the recombination event distribution class.

*do_VDJ_alignments.m reads the sequence data files and produces 'alignments' files that contain information defining all plausible recombination events that could have generated each sequence. 

*V_DJ_cuts_insertions_dinuc_ntbias_model_assignments.m contains the function that evaluates the generation probabilities of sequences by running over all events that can generate the sequence.

*do_probabilistic_model_fitting.m runs the iterative expectation-maximization algorithm, using the 'alignments' files as input. 

*evaluate_generation_probabilities.m computes the probabilities of generation of a set of sequences, when the 'alignments' files are provided.

*mexify.m contains the commands to generate and compile C code from the Matlab code for three key functions for both the 'alignments' step and the EM algorithm. This uses Matlab's Coder Toolbox and it should be run before doing any "production" runs to generate alignment files or infer models.

results/
results.tar expands to two directories:

results/alignments
Contains sample 'alignments' files (in .mat format) produced by running do_VDJ_alignments on sequence data from a few individuals. The alignment files are large, and we have divided them into "chunks" for ease of manipulation. 

results/models/
Contains 'models' generated by application of the EM algorithm to two data sets. A Matlab file containing the model data is recorded for each iteration of the EM algorithm. 

*naive_noncoding_9_indivs_results.mat and mem_noncoding_9_indivs_results.mat  contain the key final results for the distributions of recombination events, the generation probabilities, etc in Matlab's binary file format.  The variable mean_model contains the mean inferred distribution of recombination events.  The 'counter' variables contain unnormalized distributions for each individual.

*event_distributions.xls is an excel spreadsheet containing the final inferred mean distribution of recombination events.


How to run the scripts:
-----------------------------
0. As mentioned above, if the computing platform supports C code, launch Matlab and run mexify (no arguments) before anything else (this only needs to be done once as long as the scripts are not modified). If the computing platform supports multiple cores and the Parallel Toolbox is installed, Matlab has to be told how many "workers" will be assigned to the computation. On a quad-core laptop, one would simply execute "matlabpool(4)" to assign all four cores. The procedure for assigning multiple processors to a Matlab job on a scientific computing cluster varies considerably from installation to installation (in short, we can't give useful general advice on how to get this code up and running as a multiprocessor job on a generic cluster). 

1. To generate 'alignments' files from the data files using do_VDJ_alignments:

tic; do_VDJ_alignments('naive1_noncoding.data',1,100); toc;

With these arguments, the script generates initial alignments for every 100th sequence, starting with the first sequence. The ability to process only a subset of the sequence data is useful for diagnostic purposes.

2. To run Expectation-Maximization algorithm using the 'alignments' files generated by above command:

a. Start from a 'flat' initial guess and do 30 iterations:

tic; do_probabilistic_model_fitting(1:30,'-alignments_file_base_name','results/alignments/naive1_noncoding','-alignments_file_indices',1:9,'-model_type', 'V_DJ_cuts_insertions_dinuc_ntbias_model','-chunk_size',500, '-start_from_flat_prior', 1); toc

b. Resume from a given iteration number (say 7), loading the result of the previous iteration (6):

tic; do_probabilistic_model_fitting(7:30,'-alignments_file_base_name','results/alignments/naive1_noncoding','-alignments_file_indices',1:9,'-model_type', 'V_DJ_cuts_insertions_dinuc_ntbias_model','-chunk_size',500); toc

3. To evaluate the generation probabilities of a set of sequences (for instance of the functional repertoire) whose alignment files have been produced already:

tic; evaluate_generation_probabilities('-model_file','results/models/naive2_noncoding_V_DJ_cuts_insertions_dinuc_ntbias_model/iteration_30/model.mat','-model_type','V_DJ_cuts_insertions_dinuc_ntbias_model','-alignments_file_base_name','results/alignments/naive2_noncoding'); toc;