Workflow Example
Workflow Example
MaxQuant
For the analysis we require the proteinGroups.txt generated by MaxQuant and a corresponding metadata file where the sample names match with the sample names in the proteinGroups.txt file. At first take a look at the files:
[2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl") # remove warning from mac
The proteinGroups.txt contains all standard column headers from MaxQuant. Later, for our analysis we will use the Protein Intensity described in "LFQ intensity [sample]".
[3]:
protein_groups = pd.read_csv("../testfiles/maxquant/proteinGroups.txt", sep = "\t", low_memory=False)
protein_groups.head(5)
[3]:
| Protein IDs | Majority protein IDs | Peptide counts (all) | Peptide counts (razor+unique) | Peptide counts (unique) | Protein names | Gene names | Fasta headers | Number of proteins | Peptides | ... | Potential contaminant | id | Peptide IDs | Peptide is razor | Mod. peptide IDs | Evidence IDs | MS/MS IDs | Best MS/MS | Oxidation (M) site IDs | Oxidation (M) site positions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW... | P01911;Q29830;Q9MXZ4;Q3LTJ8;Q3LTJ4;Q3LRY0;Q8HW... | 4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;... | 4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;4;... | 0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;... | HLA class II histocompatibility antigen, DRB1-... | HLA-DRB1;HLA-DR15;HLA-DRB1*;HLA-DRB1*1327;MHC ... | ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;... | 1834.0 | 4.0 | ... | NaN | 0.0 | 1287;1288;3971;15222 | True;True;True;True | 1415;1416;4387;16993 | 117340;117341;117342;117343;117344;117345;1173... | 56384;56385;56386;159948;159949;159950;602426;... | 56384;56385;159949;602427 | NaN | NaN |
| 1 | P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9 | P05121;A0A024QYT5;B7ZAB0;B7Z4X6;B7Z1D9 | 10;10;9;8;5 | 10;10;9;8;5 | 10;10;9;8;5 | Plasminogen activator inhibitor 1 | SERPINE1 | ;;;; | 5.0 | 10.0 | ... | NaN | 1.0 | 1592;3771;4396;4628;6470;7404;9193;11222;13191... | True;True;True;True;True;True;True;True;True;True | 1746;4166;4871;5128;7155;8188;10148;10149;1257... | 136637;136638;136639;136640;136641;318672;3186... | 64917;149749;149750;178775;178776;184126;25946... | 64917;149749;178775;184126;259465;290355;36587... | NaN | NaN |
| 2 | P55083;A0A024QZ34;K7ES70 | P55083;A0A024QZ34;K7ES70 | 2;2;2 | 2;2;2 | 2;2;2 | Microfibril-associated glycoprotein 4 | MFAP4 | ;; | 3.0 | 2.0 | ... | NaN | 2.0 | 140;15067 | True;True | 155;156;16823 | 13174;13175;13176;13177;13178;13179;13180;1318... | 7599;7600;7601;7602;7603;7604;7605;7606;598026... | 7604;598026 | 0 | 117 |
| 3 | P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;... | P09972;A0A024QZ64;A8MVZ9;B7Z3K9;B7Z1N6;B7Z3K7;... | 16;16;15;15;13;12;10;10;9;8;7;7;7;5 | 13;13;13;12;10;10;9;9;8;5;7;7;6;4 | 2;2;2;2;0;2;2;2;2;0;2;2;0;0 | Fructose-bisphosphate aldolase C;Fructose-bisp... | ALDOC | ;;;;;;;;; | 14.0 | 16.0 | ... | NaN | 3.0 | 312;749;1545;1675;2085;3265;5046;5708;8847;884... | True;True;True;False;True;True;True;True;True;... | 346;827;1698;1838;2304;3606;5586;5587;6314;976... | 29006;29007;29008;69091;69092;69093;69094;6909... | 15520;15521;34013;34014;34015;34016;34017;6325... | 15520;34015;63252;67942;84108;131214;198759;21... | 1;2 | 40;251 |
| 4 | Q96C19;H0Y4Y4;A0A024QZ77 | Q96C19;H0Y4Y4;A0A024QZ77 | 1;1;1 | 1;1;1 | 1;1;1 | EF-hand domain-containing protein D2 | EFHD2 | ;; | 3.0 | 1.0 | ... | NaN | 4.0 | 8772 | True | 9685 | 734937;734938;734939;734940;734941;734942;734943 | 342134;342135 | 342134 | NaN | NaN |
5 rows × 2530 columns
Further, we created a excel-file with the corresponding metadata to our proteinGroups.txt-file. The sample names in the column “sample” match the names in proteinGroups.txt file.
[4]:
metadata = pd.read_excel("../testfiles/maxquant/metadata.xlsx")
metadata.head(5)
[4]:
| subject external_id | biological_sample external_id | sample | tissue id | disease id | intervention id | tissue | disease | biological_sample quantity | biological_sample quantity_units | ... | Alanine aminotransferase measurement (34608000) | Aspartate aminotransferase measurement (45896001) | Alkaline phosphatase measurement (88810008) | Gamma glutamyl transferase measurement (69480007) | Hemoglobin A1c measurement (43396009) | Total cholesterol:HDL ratio measurement (166842003) | High density lipoprotein measurement (17888004) | Low density lipoprotein cholesterol measurement (113079009) | VLDL cholesterol measurement (104585005) | Triglycerides measurement (14740000) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 31 | 31 | 1_31_C6 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 24.0 | 30 | 54 | 21.0 | 6.3 | 3.6 | 1.26 | 2.1 | 0.3 | 0.58 |
| 1 | 32 | 32 | 1_32_C7 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 27.0 | 28 | 27 | 38.0 | 5.8 | 6.6 | 1.70 | 4.3 | 0.6 | 1.24 |
| 2 | 33 | 33 | 1_33_C8 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 18.0 | 21 | 69 | 18.0 | 6.2 | 5.7 | 1.12 | 4.1 | 0.5 | 1.12 |
| 3 | 34 | 34 | 1_34_C9 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 22.0 | 26 | 101 | 20.0 | 6.2 | 6.7 | 0.91 | 4.8 | 1.0 | 2.20 |
| 4 | 35 | 35 | 1_35_C10 | BTO:0000131 | NaN | NaN | blood plasma | healthy | NaN | NaN | ... | 18.0 | 25 | 61 | 13.0 | 5.4 | 5.5 | 1.21 | 3.9 | 0.4 | 0.90 |
5 rows × 46 columns
0. Import AlphaStats
[1]:
import alphastats
1. Import Data
[3]:
maxquant_data = alphastats.MaxQuantLoader(file="../testfiles/maxquant/proteinGroups.txt",
intensity_column="LFQ intensity [sample]",
index_column="Protein IDs")
2. Create a DataSet
Combine the imported MaxQuant data with the metadata
[28]:
ds = alphastats.DataSet(
loader = maxquant_data,
metadata_path_or_df = "../testfiles/maxquant/metadata.xlsx",
sample_column = "sample" # specify the column that corresponds to the sample names in proteinGroups
)
DataSet has been created.
Attributes of the DataSet can be accessed using:
DataSet.rawinput: Raw Protein data.
DataSet.mat: Processed data matrix with ProteinIDs/ProteinGroups as columns and samples as rows. All computations are performed on this matrix.
DataSet.metadata: Metadata for the samples in the matrix. Metadata will be matched with DataSet.mat when needed (for instance Volcano Plot).
AlphaStats will create a matrix of the Protein Intensities, which will be accessable using ds.mat and will save the metadta as a dataframe ds.metadata. Our original MaxQuant ProteinGroup file contains much more samples, than we have metadata for
3. Preprocess
[29]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")
Number of samples in the matrix: 312, number of samples in metadata: 48.
Firstly, we will subset the matrix it will only contains samples, that are also described in the metadata.
[30]:
ds.preprocess(subset=True)
[31]:
print(f"Number of samples in the matrix: {ds.mat.shape[0]}, number of samples in metadata: {ds.metadata.shape[0]}.")
Number of samples in the matrix: 48, number of samples in metadata: 48.
Unnormalized data, Sample Distribution
[32]:
plot = ds.plot_sampledistribution(color = "disease")
plot.show(renderer = "svg")
Contaminations get removed indicated in following columns Only identified by site, Reverse, Potential contaminant (MaxQuant specific) and contamination_library (added by AlphaStats)
Normalized using quantile normalization
Missing Values get imputed using K-nearest neighbour imputation
[33]:
ds.preprocess(
remove_contaminations=True,
normalization = "quantile",
imputation = "knn"
)
/Users/drq441/opt/anaconda3/lib/python3.9/site-packages/sklearn/preprocessing/_data.py:2590: UserWarning:
n_quantiles (1000) is greater than the total number of samples (48). n_quantiles is set to n_samples.
The preprocessing steps can be accessed using:
[34]:
ds.preprocess_print_info()
0 1
0 Raw data number of Protein Groups 2611
1 Matrix: Number of samples 48
2 Matrix: Number of ProteinIDs/ProteinGroups 2409
3 Contaminations have been removed True
4 Normalization quantile
5 Imputation k-Nearest Neighbor
6 Number of removed ProteinGroups due to contami... 202
4. Visualization
Principal Component Analysis (PCA)
[35]:
plot = ds.plot_pca(group = "disease", circle = True)
plot.show(renderer = "svg")