DataSet
DataSet
- class DataSet(loader, metadata_path=None, sample_column=None)[source]
Analysis Object
- ancova(protein_id, covar, between)
Analysis of covariance (ANCOVA) with on or more covariate(s). Wrapper around = https://pingouin-stats.org/generated/pingouin.ancova.html
- Parameters
protein_id (str) – ProteinID/ProteinGroup - dependent variable
covar (str or list) – Name(s) of column(s) in metadata with the covariate.
between (str) – Name of column in data with the between factor.
- Returns
ANCOVA summary:
'Source': Names of the factor considered'SS': Sums of squares'DF': Degrees of freedom'F': F-values'p-unc': Uncorrected p-values'np2': Partial eta-squared
- Return type
pandas.Dataframe
- anova(column, protein_ids='all', tukey=True)
One-way Analysis of Variance (ANOVA)
- Parameters
column (str) – A metadata column used to calculate ANOVA
ids (str or list, optional) – ProteinIDs to calculate ANOVA for - dependend variable either ProteinID as string, several ProteinIDs as list or “all” to calculate ANOVA for all ProteinIDs. Defaults to “all”.
tukey (bool, optional) – Whether to calculate a Tukey-HSD post-hoc test. Defaults to True.
- Returns
'Protein ID': ProteinID/ProteinGroup'ANOVA_pvalue': p-value of ANOVA'A vs. B Tukey test': Tukey-HSD corrected p-values (each combination represents a column)
- Return type
pandas.DataFrame
- calculate_tukey(protein_id, group, df=None)
Calculate Pairwise Tukey-HSD post-hoc test Wrapper around: https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html#pingouin.pairwise_tukey
- Parameters
protein_id (str) – ProteinID to calculate Pairwise Tukey-HSD post-hoc test - dependend variable
group (str) – A metadata column used calculate pairwise tukey
df (pandas.DataFrame, optional) – Defaults to None.
- Returns
'A': Name of first measurement'B': Name of second measurement'mean(A)': Mean of first measurement'mean(B)': Mean of second measurement'diff': Mean difference (= mean(A) - mean(B))'se': Standard error'T': T-values'p-tukey': Tukey-HSD corrected p-values'hedges': Hedges effect size (or any effect size defined in
effsize) *'comparison': combination of measurment *'Protein ID': ProteinID/ProteinGroup- Return type
pandas.DataFrame
- create_matrix()[source]
Creates a matrix of the Outputfile, with columns displaying features (Proteins) and rows the samples.
- load_metadata(file_path, sample_column)[source]
Load metadata either xlsx, txt, csv or txt file
- Parameters
file_path (str) – path to metadata file
sample_column (str) – column name with sample IDs
- perform_diff_expression_analysis(group1, group2, column=None, method='ttest')
Perform differential expression analysis doing a a t-test or Wald test. A wald test will fit a generalized linear model.
- Parameters
column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str,optional) – statistical method to calculate differential expression, for Wald-test ‘wald’. Default ‘ttest’
- Returns
pandas Dataframe with foldchange, foldchange_log2 and pvalue for each ProteinID/ProteinGroup between group1 and group2.
'Protein ID': ProteinID/ProteinGroup'pval': p-value of the ProteinID/ProteinGroup'qval': multiple testing - corrected p-value'log2fc': log2(foldchange)'grad': the gradient of the log-likelihood'coef_mle': the maximum-likelihood estimate of coefficient in liker-space'coef_sd': the standard deviation of the coefficient in liker-space'll': the log-likelihood of the estimation
- Return type
pandas.DataFrame
- plot_clustermap(**kwargs)
- plot_correlation_matrix(method='pearson')
Plot Correlation Matrix
- Parameters
method (str, optional) – orrelation coefficient “pearson”, “kendall” (Kendall Tau correlation)
"spearman" (or) –
- Returns
Correlation matrix
- Return type
plotly.graph_objects._figure.Figure
- plot_dendrogram(**kwargs)
- plot_intensity(protein_id, group=None, subgroups=None, method='box', add_significance=False, log_scale=False)
Plot Intensity of individual Protein/ProteinGroup
- Parameters
ID (str) – ProteinGroup ID
group (str, optional) – A metadata column used for grouping. Defaults to None.
subgroups (list, optional) – Select variables from the group column. Defaults to None.
method (str, optional) – Violinplot = “violin”, Boxplot = “box”, Scatterplot = “scatter”. Defaults to “box”.
add_significance (bool, optional) – add p-value bar, only possible when two groups are compared. Default False.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.
- Returns
Plotly Plot
- Return type
plotly.graph_objects._figure.Figure
- plot_pca(group=None, circle=False)
Plot Principal Component Analysis (PCA)
- Parameters
group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.
- Returns
PCA plot
- Return type
plotly.graph_objects._figure.Figure
- plot_sampledistribution(method='violin', color=None, log_scale=False)
Plot Intensity Distribution for each sample. Either Violin or Boxplot
- Parameters
method (str, optional) – Violinplot = “violin”, Boxplot = “box”. Defaults to “violin”.
color (_type_, optional) – A metadata column used to color the boxes. Defaults to None.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.
- Returns
Plotly Sample Distribution Plot
- Return type
plotly.graph_objects._figure.Figure
- plot_tsne(group=None, circle=False, perplexity=30, n_iter=1000)
Plot t-distributed stochastic neighbor embedding (t-SNE)
- Parameters
group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.
- Returns
t-SNE plot
- Return type
plotly.graph_objects._figure.Figure
- plot_umap(group=None, circle=False)
Plot Uniform Manifold Approximation and Projection for Dimension Reduction
- Parameters
group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.
- Returns
UMAP plot
- Return type
plotly.graph_objects._figure.Figure
- plot_volcano(group1, group2, column=None, method='ttest', labels=False, min_fc=1, alpha=0.05, draw_line=True)
Plot Volcano Plot
- Parameters
column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str) – “anova”, “wald”, “ttest”, Defaul ttest.
labels (bool) – Add text labels to significant Proteins, Default False.
alpha (float,optional) – p-value cut off.
min_fc (float) – Minimum fold change
draw_line (boolean) – whether to draw cut off lines.
- Returns
Volcano Plot
- Return type
plotly.graph_objects._figure.Figure
- preprocess(remove_contaminations=False, subset=False, normalization=None, imputation=None, remove_samples=None)
Preprocess Protein data
Removal of contaminations:
Removes all observations, that were identified as contaminations.
Normalization:
“zscore”, “quantile”, “linear”, “vst”
Normalize data using either zscore, quantile or linear (using l2 norm) Normalization.
Z-score normalization equals standaridzation using StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Variance stabilization transformation uses: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html
For more information visit. Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
Imputation:
“mean”, “median”, “knn” or “randomforest” For more information visit:
SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
k-Nearest Neighbors Imputation: https://scikit-learn.org/stable/modules/impute.html#impute
Random Forest Imputation: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
- Parameters
remove_contaminations (bool, optional) – remove ProteinGroups that are identified as contamination. Defaults to False.
normalization (str, optional) – method to normalize data: either “zscore”, “quantile”, “linear”. Defaults to None.
remove_samples (list, optional) – list with sample ids to remove. Defaults to None.
imputation (str, optional) – method to impute data: either “mean”, “median”, “knn” or “randomforest”. Defaults to None.
- preprocess_print_info()
Print summary of preprocessing steps