Workflow Example

Workflow Example — AlphaStats 0.0.5 documentation

Workflow Example

MaxQuant

For the analysis we require the proteinGroups.txt generated by MaxQuant and a corresponding metadata file where the sample names match with the sample names in the proteinGroups.txt file. At first take a look at the files:

[2]:

import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl") # remove warning from mac

The proteinGroups.txt contains all standard column headers from MaxQuant. Later, for our analysis we will use the Protein Intensity described in "LFQ intensity [sample]".

[3]:

protein_groups = pd.read_csv("../testfiles/maxquant/proteinGroups.txt", sep = "\t", low_memory=False)
protein_groups.head(5)

[3]:

	Protein IDs	Majority protein IDs	Peptide counts (all)	Peptide counts (razor+unique)	Peptide counts (unique)	Protein names	Gene names	Fasta headers	Number of proteins	Peptides	...	Potential contaminant	id	Peptide IDs	Peptide is razor	Mod. peptide IDs	Evidence IDs	MS/MS IDs	Best MS/MS	Oxidation (M) site IDs	Oxidation (M) site positions
0	P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW...	P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW...	4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;...	4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;...	0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;...	HLA class II histocompatibility antigen, DRB1-...	HLA-DRB1;HLA-DR15;HLA-DRB1;HLA-DRB11327;MHC ...	;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;...	1834.0	4.0	...	NaN	0.0	1287;1288;3971;15222	True;True;True;True	1415;1416;4387;16993	117340;117341;117342;117343;117344;117345;1173...	56384;56385;56386;159948;159949;159950;602426;...	56384;56385;159949;602427	NaN	NaN
1	P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9	P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9	10;10;9;8;5	10;10;9;8;5	10;10;9;8;5	Plasminogen activator inhibitor 1	SERPINE1	;;;;	5.0	10.0	...	NaN	1.0	1592;3771;4396;4628;6470;7404;9193;11222;13191...	True;True;True;True;True;True;True;True;True;True	1746;4166;4871;5128;7155;8188;10148;10149;1257...	136637;136638;136639;136640;136641;318672;3186...	64917;149749;149750;178775;178776;184126;25946...	64917;149749;178775;184126;259465;290355;36587...	NaN	NaN
2	P55083;A0A024QZ34;K7ES70	P55083;A0A024QZ34;K7ES70	2;2;2	2;2;2	2;2;2	Microfibril-associated glycoprotein 4	MFAP4	;;	3.0	2.0	...	NaN	2.0	140;15067	True;True	155;156;16823	13174;13175;13176;13177;13178;13179;13180;1318...	7599;7600;7601;7602;7603;7604;7605;7606;598026...	7604;598026	0	117
3	P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;...	P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;...	16;16;15;15;13;12;10;10;9;8;7;7;7;5	13;13;13;12;10;10;9;9;8;5;7;7;6;4	2;2;2;2;0;2;2;2;2;0;2;2;0;0	Fructose-bisphosphate aldolase C;Fructose-bisp...	ALDOC	;;;;;;;;;	14.0	16.0	...	NaN	3.0	312;749;1545;1675;2085;3265;5046;5708;8847;884...	True;True;True;False;True;True;True;True;True;...	346;827;1698;1838;2304;3606;5586;5587;6314;976...	29006;29007;29008;69091;69092;69093;69094;6909...	15520;15521;34013;34014;34015;34016;34017;6325...	15520;34015;63252;67942;84108;131214;198759;21...	1;2	40;251
4	Q96C19;H0Y4Y4;A0A024QZ77	Q96C19;H0Y4Y4;A0A024QZ77	1;1;1	1;1;1	1;1;1	EF-hand domain-containing protein D2	EFHD2	;;	3.0	1.0	...	NaN	4.0	8772	True	9685	734937;734938;734939;734940;734941;734942;734943	342134;342135	342134	NaN	NaN

5 rows × 2530 columns

Further, we created a excel-file with the corresponding metadata to our proteinGroups.txt-file. The sample names in the column “sample” match the names in proteinGroups.txt file.

[4]:

metadata = pd.read_excel("../testfiles/maxquant/metadata.xlsx")
metadata.head(5)

[4]:

	subject external_id	biological_sample external_id	sample	tissue id	disease id	intervention id	tissue	disease	biological_sample quantity	biological_sample quantity_units	...	Alanine aminotransferase measurement (34608000)	Aspartate aminotransferase measurement (45896001)	Alkaline phosphatase measurement (88810008)	Gamma glutamyl transferase measurement (69480007)	Hemoglobin A1c measurement (43396009)	Total cholesterol:HDL ratio measurement (166842003)	High density lipoprotein measurement (17888004)	Low density lipoprotein cholesterol measurement (113079009)	VLDL cholesterol measurement (104585005)	Triglycerides measurement (14740000)
0	31	31	1_31_C6	BTO:0000131	NaN	NaN	blood plasma	healthy	NaN	NaN	...	24.0	30	54	21.0	6.3	3.6	1.26	2.1	0.3	0.58
1	32	32	1_32_C7	BTO:0000131	NaN	NaN	blood plasma	healthy	NaN	NaN	...	27.0	28	27	38.0	5.8	6.6	1.70	4.3	0.6	1.24
2	33	33	1_33_C8	BTO:0000131	NaN	NaN	blood plasma	healthy	NaN	NaN	...	18.0	21	69	18.0	6.2	5.7	1.12	4.1	0.5	1.12
3	34	34	1_34_C9	BTO:0000131	NaN	NaN	blood plasma	healthy	NaN	NaN	...	22.0	26	101	20.0	6.2	6.7	0.91	4.8	1.0	2.20
4	35	35	1_35_C10	BTO:0000131	NaN	NaN	blood plasma	healthy	NaN	NaN	...	18.0	25	61	13.0	5.4	5.5	1.21	3.9	0.4	0.90

5 rows × 46 columns

0. Import AlphaStats

[1]:

import alphastats

1. Import Data

[3]:

maxquant_data = alphastats.MaxQuantLoader(file="../testfiles/maxquant/proteinGroups.txt",
                                          intensity_column="LFQ intensity [sample]",
                                          index_column="Protein IDs")

2. Create a DataSet

Combine the imported MaxQuant data with the metadata

[28]:

ds = alphastats.DataSet(
    loader = maxquant_data,
    metadata_path_or_df = "../testfiles/maxquant/metadata.xlsx",
    sample_column = "sample" # specify the column that corresponds to the sample names in proteinGroups
)

DataSet has been created.
Attributes of the DataSet can be accessed using:
DataSet.rawinput:         Raw Protein data.
DataSet.mat:    Processed data matrix with ProteinIDs/ProteinGroups as columns and samples as rows. All computations are performed on this matrix.
DataSet.metadata:       Metadata for the samples in the matrix. Metadata will be matched with DataSet.mat when needed (for instance Volcano Plot).

AlphaStats will create a matrix of the Protein Intensities, which will be accessable using ds.mat and will save the metadta as a dataframe ds.metadata. Our original MaxQuant ProteinGroup file contains much more samples, than we have metadata for

3. Preprocess

[29]:

print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")

Number of samples in the matrix: 312, number of samples in metadata: 48.

Firstly, we will subset the matrix it will only contains samples, that are also described in the metadata.

[30]:

ds.preprocess(subset=True)

[31]:

print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")

Number of samples in the matrix: 48, number of samples in metadata: 48.

Unnormalized data, Sample Distribution

[32]:

plot = ds.plot_sampledistribution(color = "disease")
plot.show(renderer = "svg")

Contaminations get removed indicated in following columns Only identified by site, Reverse, Potential contaminant (MaxQuant specific) and contamination_library (added by AlphaStats)
Normalized using quantile normalization
Missing Values get imputed using K-nearest neighbour imputation

[33]:

ds.preprocess(
    remove_contaminations=True,
    normalization = "quantile",
    imputation = "knn"
)

/Users/drq441/opt/anaconda3/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:2590: UserWarning:

n_quantiles (1000) is greater than the total number of samples (48). n_quantiles is set to n_samples.

The preprocessing steps can be accessed using:

[34]:

ds.preprocess_print_info()

                                                   0                   1
0                  Raw data number of Protein Groups                2611
1                          Matrix: Number of samples                  48
2         Matrix: Number of ProteinIDs/ProteinGroups                2409
3                   Contaminations have been removed                True
4                                      Normalization            quantile
5                                         Imputation  k-Nearest Neighbor
6  Number of removed ProteinGroups due to contami...                 202

4. Visualization

Principal Component Analysis (PCA)

[35]:

plot = ds.plot_pca(group = "disease", circle = True)
plot.show(renderer = "svg")