DataSet

DataSet

class DataSet(loader, metadata_path=None, sample_column=None)[source]

Analysis Object

ancova(protein_id, covar, between)

Analysis of covariance (ANCOVA) with on or more covariate(s). Wrapper around = https://pingouin-stats.org/generated/pingouin.ancova.html

Parameters
  • protein_id (str) – ProteinID/ProteinGroup - dependent variable

  • covar (str or list) – Name(s) of column(s) in metadata with the covariate.

  • between (str) – Name of column in data with the between factor.

Returns

ANCOVA summary:

  • 'Source': Names of the factor considered

  • 'SS': Sums of squares

  • 'DF': Degrees of freedom

  • 'F': F-values

  • 'p-unc': Uncorrected p-values

  • 'np2': Partial eta-squared

Return type

pandas.Dataframe

anova(column, protein_ids='all', tukey=True)

One-way Analysis of Variance (ANOVA)

Parameters
  • column (str) – A metadata column used to calculate ANOVA

  • ids (str or list, optional) – ProteinIDs to calculate ANOVA for - dependend variable either ProteinID as string, several ProteinIDs as list or “all” to calculate ANOVA for all ProteinIDs. Defaults to “all”.

  • tukey (bool, optional) – Whether to calculate a Tukey-HSD post-hoc test. Defaults to True.

Returns

  • 'Protein ID': ProteinID/ProteinGroup

  • 'ANOVA_pvalue': p-value of ANOVA

  • 'A vs. B Tukey test': Tukey-HSD corrected p-values (each combination represents a column)

Return type

pandas.DataFrame

calculate_tukey(protein_id, group, df=None)

Calculate Pairwise Tukey-HSD post-hoc test Wrapper around: https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html#pingouin.pairwise_tukey

Parameters
  • protein_id (str) – ProteinID to calculate Pairwise Tukey-HSD post-hoc test - dependend variable

  • group (str) – A metadata column used calculate pairwise tukey

  • df (pandas.DataFrame, optional) – Defaults to None.

Returns

  • 'A': Name of first measurement

  • 'B': Name of second measurement

  • 'mean(A)': Mean of first measurement

  • 'mean(B)': Mean of second measurement

  • 'diff': Mean difference (= mean(A) - mean(B))

  • 'se': Standard error

  • 'T': T-values

  • 'p-tukey': Tukey-HSD corrected p-values

  • 'hedges': Hedges effect size (or any effect size defined in

effsize) * 'comparison': combination of measurment * 'Protein ID': ProteinID/ProteinGroup

Return type

pandas.DataFrame

create_matrix()[source]

Creates a matrix of the Outputfile, with columns displaying features (Proteins) and rows the samples.

load_metadata(file_path, sample_column)[source]

Load metadata either xlsx, txt, csv or txt file

Parameters
  • file_path (str) – path to metadata file

  • sample_column (str) – column name with sample IDs

overview()[source]

Print overview of the DataSet

perform_diff_expression_analysis(group1, group2, column=None, method='ttest')

Perform differential expression analysis doing a a t-test or Wald test. A wald test will fit a generalized linear model.

Parameters
  • column (str) – column name in the metadata file with the two groups to compare

  • group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • method (str,optional) – statistical method to calculate differential expression, for Wald-test ‘wald’. Default ‘ttest’

Returns

pandas Dataframe with foldchange, foldchange_log2 and pvalue for each ProteinID/ProteinGroup between group1 and group2.

  • 'Protein ID': ProteinID/ProteinGroup

  • 'pval': p-value of the ProteinID/ProteinGroup

  • 'qval': multiple testing - corrected p-value

  • 'log2fc': log2(foldchange)

  • 'grad': the gradient of the log-likelihood

  • 'coef_mle': the maximum-likelihood estimate of coefficient in liker-space

  • 'coef_sd': the standard deviation of the coefficient in liker-space

  • 'll': the log-likelihood of the estimation

Return type

pandas.DataFrame

plot_clustermap(**kwargs)
plot_correlation_matrix(method='pearson')

Plot Correlation Matrix

Parameters
  • method (str, optional) – orrelation coefficient “pearson”, “kendall” (Kendall Tau correlation)

  • "spearman" (or) –

Returns

Correlation matrix

Return type

plotly.graph_objects._figure.Figure

plot_dendrogram(**kwargs)
plot_intensity(protein_id, group=None, subgroups=None, method='box', add_significance=False, log_scale=False)

Plot Intensity of individual Protein/ProteinGroup

Parameters
  • ID (str) – ProteinGroup ID

  • group (str, optional) – A metadata column used for grouping. Defaults to None.

  • subgroups (list, optional) – Select variables from the group column. Defaults to None.

  • method (str, optional) – Violinplot = “violin”, Boxplot = “box”, Scatterplot = “scatter”. Defaults to “box”.

  • add_significance (bool, optional) – add p-value bar, only possible when two groups are compared. Default False.

  • log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Plot

Return type

plotly.graph_objects._figure.Figure

plot_pca(group=None, circle=False)

Plot Principal Component Analysis (PCA)

Parameters
  • group (str, optional) – column in metadata that should be used for coloring. Defaults to None.

  • circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

PCA plot

Return type

plotly.graph_objects._figure.Figure

plot_sampledistribution(method='violin', color=None, log_scale=False)

Plot Intensity Distribution for each sample. Either Violin or Boxplot

Parameters
  • method (str, optional) – Violinplot = “violin”, Boxplot = “box”. Defaults to “violin”.

  • color (_type_, optional) – A metadata column used to color the boxes. Defaults to None.

  • log_scale (bool, optional) – yaxis in logarithmic scale. Defaults to False.

Returns

Plotly Sample Distribution Plot

Return type

plotly.graph_objects._figure.Figure

plot_tsne(group=None, circle=False, perplexity=30, n_iter=1000)

Plot t-distributed stochastic neighbor embedding (t-SNE)

Parameters
  • group (str, optional) – column in metadata that should be used for coloring. Defaults to None.

  • circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

t-SNE plot

Return type

plotly.graph_objects._figure.Figure

plot_umap(group=None, circle=False)

Plot Uniform Manifold Approximation and Projection for Dimension Reduction

Parameters
  • group (str, optional) – column in metadata that should be used for coloring. Defaults to None.

  • circle (bool, optional) – draw circle around each group. Defaults to False.

Returns

UMAP plot

Return type

plotly.graph_objects._figure.Figure

plot_volcano(group1, group2, column=None, method='ttest', labels=False, min_fc=1, alpha=0.05, draw_line=True)

Plot Volcano Plot

Parameters
  • column (str) – column name in the metadata file with the two groups to compare

  • group1 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • group2 (str/list) – name of group to compare needs to be present in column or list of sample names to compare

  • method (str) – “anova”, “wald”, “ttest”, Defaul ttest.

  • labels (bool) – Add text labels to significant Proteins, Default False.

  • alpha (float,optional) – p-value cut off.

  • min_fc (float) – Minimum fold change

  • draw_line (boolean) – whether to draw cut off lines.

Returns

Volcano Plot

Return type

plotly.graph_objects._figure.Figure

preprocess(remove_contaminations=False, subset=False, normalization=None, imputation=None, remove_samples=None)

Preprocess Protein data

Removal of contaminations:

Removes all observations, that were identified as contaminations.

Normalization:

“zscore”, “quantile”, “linear”, “vst”

Normalize data using either zscore, quantile or linear (using l2 norm) Normalization.

Z-score normalization equals standaridzation using StandardScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Variance stabilization transformation uses: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

For more information visit. Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html

Imputation:

“mean”, “median”, “knn” or “randomforest” For more information visit:

SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

k-Nearest Neighbors Imputation: https://scikit-learn.org/stable/modules/impute.html#impute

Random Forest Imputation: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

Parameters
  • remove_contaminations (bool, optional) – remove ProteinGroups that are identified as contamination. Defaults to False.

  • normalization (str, optional) – method to normalize data: either “zscore”, “quantile”, “linear”. Defaults to None.

  • remove_samples (list, optional) – list with sample ids to remove. Defaults to None.

  • imputation (str, optional) – method to impute data: either “mean”, “median”, “knn” or “randomforest”. Defaults to None.

preprocess_print_info()

Print summary of preprocessing steps