DataSet

class DataSet(loader, metadata_path=None, sample_column=None)[source]

Analysis Object

ancova(protein_id, covar, between)

Analysis of covariance (ANCOVA) with on or more covariate(s). Wrapper around = https://pingouin-stats.org/generated/pingouin.ancova.html

Parameters

protein_id (str) – ProteinID/ProteinGroup - dependent variable
covar (str or list) – Name(s) of column(s) in metadata with the covariate.
between (str) – Name of column in data with the between factor.

Returns

ANCOVA summary:

'Source': Names of the factor considered
'SS': Sums of squares
'DF': Degrees of freedom
'F': F-values
'p-unc': Uncorrected p-values
'np2': Partial eta-squared

Return type

pandas.Dataframe

anova(column, protein_ids='all', tukey=True)

One-way Analysis of Variance (ANOVA)

Parameters

column (str) – A metadata column used to calculate ANOVA
ids (str or list, optional) – ProteinIDs to calculate ANOVA for - dependend variable either ProteinID as string, several ProteinIDs as list or “all” to calculate ANOVA for all ProteinIDs. Defaults to “all”.
tukey (bool, optional) – Whether to calculate a Tukey-HSD post-hoc test. Defaults to True.

Returns

'Protein ID': ProteinID/ProteinGroup
'ANOVA_pvalue': p-value of ANOVA
'A vs. B Tukey test': Tukey-HSD corrected p-values (each combination represents a column)

Return type

pandas.DataFrame

calculate_tukey(protein_id, group, df=None)

Calculate Pairwise Tukey-HSD post-hoc test Wrapper around: https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html#pingouin.pairwise_tukey

Parameters

protein_id (str) – ProteinID to calculate Pairwise Tukey-HSD post-hoc test - dependend variable
group (str) – A metadata column used calculate pairwise tukey
df (pandas.DataFrame, optional) – Defaults to None.

Returns

'A': Name of first measurement
'B': Name of second measurement
'mean(A)': Mean of first measurement
'mean(B)': Mean of second measurement
'diff': Mean difference (= mean(A) - mean(B))
'se': Standard error
'T': T-values
'p-tukey': Tukey-HSD corrected p-values
'hedges': Hedges effect size (or any effect size defined in

effsize) * 'comparison': combination of measurment * 'Protein ID': ProteinID/ProteinGroup

Return type

pandas.DataFrame

create_matrix()[source]: Creates a matrix of the Outputfile, with columns displaying features (Proteins) and rows the samples.

load_metadata(file_path, sample_column)[source]

Load metadata either xlsx, txt, csv or txt file

Parameters

file_path (str) – path to metadata file
sample_column (str) – column name with sample IDs

overview()[source]: Print overview of the DataSet

perform_diff_expression_analysis(group1, group2, column=None, method='ttest')

Perform differential expression analysis doing a a t-test or Wald test. A wald test will fit a generalized linear model.

Parameters

column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str,optional) – statistical method to calculate differential expression, for Wald-test ‘wald’. Default ‘ttest’

Returns

pandas Dataframe with foldchange, foldchange_log2 and pvalue for each ProteinID/ProteinGroup between group1 and group2.

'Protein ID': ProteinID/ProteinGroup
'pval': p-value of the ProteinID/ProteinGroup
'qval': multiple testing - corrected p-value
'log2fc': log2(foldchange)
'grad': the gradient of the log-likelihood
'coef_mle': the maximum-likelihood estimate of coefficient in liker-space
'coef_sd': the standard deviation of the coefficient in liker-space
'll': the log-likelihood of the estimation

Return type

pandas.DataFrame

plot_clustermap(**kwargs)

plot_correlation_matrix(method='pearson')

Plot Correlation Matrix

Parameters

method (str, optional) – orrelation coefficient “pearson”, “kendall” (Kendall Tau correlation)
"spearman" (or) –

Returns

Correlation matrix

Return type

plotly.graph_objects._figure.Figure

plot_dendrogram(**kwargs)

plot_intensity(protein_id, group=None, subgroups=None, method='box', add_significance=False, log_scale=False)

Plot Intensity of individual Protein/ProteinGroup

Parameters

ID (str) – ProteinGroup ID
group (str, optional) – A metadata column used for grouping. Defaults to None.
subgroups (list, optional) – Select variables from the group column. Defaults to None.
method (str, optional) – Violinplot = “violin”, Boxplot = “box”, Scatterplot = “scatter”. Defaults to “box”.
add_significance (bool, optional) – add p-value bar, only possible when two groups are compared. Default False.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Plot

Return type

plotly.graph_objects._figure.Figure

plot_pca(group=None, circle=False)

Plot Principal Component Analysis (PCA)

Parameters

group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

PCA plot

Return type

plotly.graph_objects._figure.Figure

plot_sampledistribution(method='violin', color=None, log_scale=False)

Plot Intensity Distribution for each sample. Either Violin or Boxplot

Parameters

method (str, optional) – Violinplot = “violin”, Boxplot = “box”. Defaults to “violin”.
color (_type_, optional) – A metadata column used to color the boxes. Defaults to None.
log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Sample Distribution Plot

Return type

plotly.graph_objects._figure.Figure

plot_tsne(group=None, circle=False, perplexity=30, n_iter=1000)

Plot t-distributed stochastic neighbor embedding (t-SNE)

Parameters

group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

t-SNE plot

Return type

plotly.graph_objects._figure.Figure

plot_umap(group=None, circle=False)

Plot Uniform Manifold Approximation and Projection for Dimension Reduction

Parameters

group (str, optional) – column in metadata that should be used for coloring. Defaults to None.
circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

UMAP plot

Return type

plotly.graph_objects._figure.Figure

plot_volcano(group1, group2, column=None, method='ttest', labels=False, min_fc=1, alpha=0.05, draw_line=True)

Plot Volcano Plot

Parameters

column (str) – column name in the metadata file with the two groups to compare
group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare
method (str) – “anova”, “wald”, “ttest”, Defaul ttest.
labels (bool) – Add text labels to significant Proteins, Default False.
alpha (float,optional) – p-value cut off.
min_fc (float) – Minimum fold change
draw_line (boolean) – whether to draw cut off lines.

Returns

Volcano Plot

Return type

plotly.graph_objects._figure.Figure

preprocess(remove_contaminations=False, subset=False, normalization=None, imputation=None, remove_samples=None)

Preprocess Protein data

Removal of contaminations:

Removes all observations, that were identified as contaminations.

Normalization:

“zscore”, “quantile”, “linear”, “vst”

Normalize data using either zscore, quantile or linear (using l2 norm) Normalization.

Z-score normalization equals standaridzation using StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Variance stabilization transformation uses: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

For more information visit. Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

Imputation:

“mean”, “median”, “knn” or “randomforest” For more information visit:

SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

k-Nearest Neighbors Imputation: https://scikit-learn.org/stable/modules/impute.html#impute

Random Forest Imputation: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

Parameters

remove_contaminations (bool, optional) – remove ProteinGroups that are identified as contamination. Defaults to False.
normalization (str, optional) – method to normalize data: either “zscore”, “quantile”, “linear”. Defaults to None.
remove_samples (list, optional) – list with sample ids to remove. Defaults to None.
imputation (str, optional) – method to impute data: either “mean”, “median”, “knn” or “randomforest”. Defaults to None.

preprocess_print_info(): Print summary of preprocessing steps