GLM Flex Fast2 – Harvard Aging Brain Study

GLM_Flex_Fast2 is an all around whole brain ANOVA tool, that will produce type III SS results. It can handle your standard between subjects ANOVA, between subjects ANCOVA, and repeated measures ANOVA (really any model with a single random factor). Repeated measures ANCOVA is still off the table (it won’t error but the results are likely invalid).

GLM_Flex_Fast2 is the newest fork in the GLM_Flex project. The main impetus of this for was to increase processing speed, reduce the size of data files (e.g. no need for the oo.mat file), and make it easier to specify both models and input data.

The GLM_Flex_Fast2 fork represents the future direction of the GLM_Flex code, so I would encourage you to look into switching to this new method of running your whole brain models.

Currently there are not as many options for GLM_Flex_Fast2 as there are for GLM_Flex, but I am planning on adding these missing features to GLM_Flex_Fast2.

What is GLM_Flex_Fast2 missing?

At present I have disabled the options that allow for Variance/Covariance correction, but I do plan to add this back in down the road.

If this option is crucial for your needs then you should stick to GLM_Flex2 for the time being.

GLM_Flex_Fast2 will automatically produce output for all canonical contrasts entailed in the specified model. Additional post-hoc comparisons must be be manually specified.

The major focus of this update is to bring the GLM_Flex tools in line with other statistical tools such as R, SAS, and the recently introduced LinearModel/LinearMixedModel functions in MATLAB. The purpose is to provide a relatively simple and robust tool for performing the standard suite of GLM/ANOVA analyses at the whole brain level.

How to use GLM_Flex_Fast2

Here is an example of how to setup and use the new scripts with the example dataset.

The first part of the code below is about setting up the data structure dat. Each field in dat represents a variable, and the entries in the same row across fields correspond to the same observation. The file names for the example dataset (e.g. ss_Sub01_run1_Cond1.nii) contain all of the information that we need to specify the model, so what I do is create the data structure dat, and then loop through each input image and parse the filenames to get the relevant variable information. One big benefit of this is that it gives you a single data structure that is capable of representing your entire dataset and all variables/parameters of interest, and it removes the need to keep track of which input images go with which subjects and conditions. In GLM_Flex2 the design matrix is created independently of the data, and so the input images must be organized so that their order matches the design matrix. Here there is no such limitation since each row of each field in dat is paired we have natural manner to track which input images belong to which conditions for which variables.

NOTE: I have set things up to work with either regular data structures or the new dataset objects in the statistics toolbox that were introduced around 2012a. Starting in 2013b there is a new table object that will likely end up being the preferred data organization tool for the GLM_Flex family of tools. For the present, regular data structures will work, and this should preserve functionality for older versions of matlab and those without access to the statistics tool box. Also, It is important that categorical variables be entered with character strings. That is a variable like Sex should be entered as ‘Male’/’Female’ or ‘M’/’F’ or ‘0’/’1′, and not as 0/1. The scripts automatically treat cell-strings as categorical variables, and numeric vectors as continuous variables, so be sure the data in the dat structure is specified in the form that you would like the variable treated.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Scans = dir_wfp('/path/to/example/data/*.nii');

%%% Create a data structure to hold the variables
clear dat
dat.fn = [];
dat.Group = [];
dat.Condition = [];
dat.Run = [];
dat.SS = [];

%%% In this case all the information we need is contained in the file names, 
%%% so all we have to do is parse the file names, and store the information in the dat structure.
for ii = 1:numel(Scans)
    [a b c] = fileparts(lower(Scans{ii}));
    tmp = regexp(b,'_','split');
    
    dat.fn{ii,1} = Scans{ii};
    dat.SS{ii,1} = tmp{2};
    dat.Run{ii,1} = tmp{3};
    dat.Condition{ii,1} = tmp{4};
end

% First 8 Subjects (64 images) are in Group1 the rest are in Group2
dat.Group = cell(120,1);
dat.Group(1:64) = {'G1'};
dat.Group(65:end) = {'G2'};

%%% Now we set up I (same as before) only with new entries for data and for model.
%%% These two new entries obviate the need for the IN + CreateDesign2 step.
clear I;
I.Scans = dat.fn;
I.Model = 'Group*Run*Condition + random(SS|Run*Condition)';
I.Data = dat;
I.OutputDir = '/path/to/analysis/output/directory';
I.RemoveOutliers = 0;
I.DoOnlyAll = 1;
I.estSmooth = 1;
 
%%% With the latest update to GLM_Flex_Fast2 we can also specify post hoc contrasts
I.PostHocs = {'Group$G1' 'G1'; 
             'Group$G2' 'G2'; 
             'Group$G1|Condition$cond1 # Group$G1|Condition$cond2' 'PostHoc3';
             'Group$G2|Condition$cond1 # Group$G2|Condition$cond2' 'PostHoc4';
             'Group$G1|Run$run1 # Group$G1|Run$run2 # Group$G1|Run$run3 # Group$G1|Run$run4' 'PostHoc5';
             'Group$G1|Condition$cond1|Run$run1 # Group$G1|Condition$cond1|Run$run2 # Group$G1|Condition$cond1|Run$run3 # Group$G1|Condition$cond1|Run$run4' 'PostHoc6';
             'Run$run1 # Run$run2' 'PostHoc7';
             'Run$run1 # Run$run2 # Run$run3 # Run$run4' 'PostHoc8';
             'Group$G1 & Group$G2' 'PostHoc9'};

GLM_Flex_Fast2(I);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Once we have created dat, then we can create the I structure for running the analysis. The options for the I-structure are almost identical to the I-structure for GLM_Flex2 with a couple of important differences. In GLM_Flex2 we must pass in the F-stucture that encodes all of the design information, and we must specify the desired output contrasts using the I.Con-Structure. In GLM_Flex_Fast2 rather than specifying I.F and I.Con, we pass in the data structure dat via I.Data = dat, and then we use a symbolic method for specifying the desired model to run. This symbolic notation should be fairly familiar to anyone who uses SAS, R, or the new LinearModel and LinearMixedModel tools from the statistics package in MATLAB. The variables named in the model specification must match (exactly) the variable names as represents in the fields of dat. Once specified the scripts will automatically generate all canonical contrasts, that is all main effects and interactions that are entailed in the model.

NEW!!! In the latest update I have added the ability to specify and perform post hoc contrasts. Contrast can be specified as either a cell vector of contrasts (column or row), in which case the contrast specification is used as the output file name, or it can be specified as above as an N-contrast by 2 cell matrix where the second column contains the desired output name of the contrast. Contrast specification uses a set of special symbols including & # | and $. & can be read as AND in the F-contrast sense of adding the some of squares of constituent parts, this is particularly useful for multiple regression contrasts when we want a map the represents the combined effects of multiple regressors (for instance this is very handy for curvilinear models), but is different from saying something like “is there a difference between the average of C1 and C2 vs. the average of C3 and C4”. This last question requires an alternative model to address. # can be read as vs., so A # B is A vs. B. The | can be through of as meaning “within”, if I want condition 1 within group1 I can specify this as: Group$G1|Condition$Cond1. This brings us to the $. The $ is used to denote a particular level of a categorical variable, note this is not needed when specifying covariates which can be specified simply by the covariate name.

The script works by first breaking the contrast into the additive components by splitting the contrast by the & symbol, this is then followed by breaking each additive component into vs. statements by splitting on #, finally | and $ are used to construct the appropriate conditions that are then recombined into vs. and additive contrasts. This system allows for nearly all contrasts. There are definitely a few that won’t work, but this should cover the majority of the post-hoc contrasts that people will want to run within their models. The scripts also use the contrast specification to determine the appropriate error term. The error term is chosen based on which levels of which variable are crossed in a contrast. For instance is the contrast is between levels of Factor1 within a single level of Factor2, then the Factor1 error term is used. If the contrast crosses levels of both factors, the the error term for the Factor1 by Factor2 interaction is used. This is all done behind the scenes, but the information can be recovered by looking at the I.MOD.RFMs structure. If you have questions about this, please post them to the google group.

My hope is that the reduced burden on properly organizing the input files, and the fully automated contrast specification will make this easier to use for the majority of people and results in fewer mistakes/errors from mis-specification. Additionally this makes data exploration much easier since it becomes quite simple to adjust the model specification and then rerun the model without having to adjust or respecify each contrast of interest.

Notes on model specification

Model specification is now as simple as specifying the model, e.g. ‘GroupRunCondition + random(SS|RunCondition)’. The terms used in the model must match the fields in the data structure. Additionally categorical variable must be specified as cell arrays with conditions entered as strings. This is necessary for proper model creation. For the above model, if you don’t want the three-way interaction in the model, then you will need to specify the model as ‘GroupRun + GroupCondition + RunCondition + random(SS|Run*Condition)’. That is you must specify the terms that you want included, with the shortcut that for any interaction specified the main effects will be automatically included. You can also specify effects with “:” that is Group:Run, which will only include the interaction, and not any of the constituent effects, namely main effects of Group and Run. Finally with the new model specification, there is no need to specify scans in a particular order. As long as all the entires in dat line up correctly, you are good to go.

Random effects – Any random effect (most typically this will be Subjects), can be included in the model, however since we are using the GLM here, only a single random effect can be specified in any model. Furthermore at present it is only possible to specify the nesting of the random effect in terms of factorial models, that is there is not a specification schema in place for noting hierarchical embedding. Once you get to that level of specification or if there are multiple random factors, it’s time to consider moving to a Mixed Effects Model rather than a GLM.

Random effects are noted by ‘random()’ notation, and the specific terms are specified as “Random Factor | within subject terms”. The pipe | can be thought as referring to “nested within”. If there is only a single repeated measure, the specification would still be random(Subj|W1) as opposed to the more common notation of 1|Subj. That is the within subject factor should always be specified after the |. Multiple within factors should be specified as random(Subj|W1W2W2W3W4…) . Additional functionality will be built out in the future, but for now I am sticking with the more typical factorial designs.

Happy data analyzing …

Notes on model creation

Models created under this new schema no longer use the long form of model specification. to illustrate the difference consider a 2×2 between subjects design. In the past, I have used the long form which results in a model with eight columns. 2 for factor 1, 2 for factor 2, and 4 for the interaction. The long form is an over-specified model that was primarily designed for transparency and custom contrast specification. The new GLM_FLex_Fast2 scripts (along with ANOVA_APS) use what I will call Effect Dummy Coding. For the 2×2 design, the first factor is represented as single column where lev1 = 1 and lev2 = -1. The same is true for the second factor, and the interaction is simply the pairwise multiplication of factor 1 and 2. This results in 3 columns for the model, but we also need a constant column which will bring the total model specification to 4 columns. This makes contrast specification as well as the sub vs super model specification very straightforward and fast, but it does make post-hoc contrast specification a bit more difficult. Finally, returning to the issue of repeated measures ANOCVA, the new scripts will let you run them, but I have not validated the results or tweaked the model specification, so it is likely that such models will return invalid or misleading results and I do not recommend that you use it.

This set of scripts also comes with ANOVA_APS which allow you to quickly run any ANOVA model with a single dependent variable.

That’s all for now,

Enjoy,

-Aaron