Menu
Introduction
Getting Started
Use cases
Technical documentations
Version
Publication date

Sep 19, 2024

Confidentiality
Public
Reactions
0
Share

Fermentalg Data Quality Check

TASK
Typing name :  TASK.gws_plate_reader.FermentalgQualityCheck Brick :  gws_plate_reader

Apply quality checks and filters to fermentation data (outliers, ranges, missing values)

[Generated by Task Expert Agent]

Apply configurable quality checks to fermentation time series data.

Overview

This task performs data quality validation and filtering on fermentation ResourceSets. It performs quality checks on the raw data ResourceSet and filters both raw and interpolated data based on which samples pass the quality checks. Designed to work with output from FermentalgLoadData or FilterFermentorAnalyseLoadedResourceSetBySelection.

Purpose

  • Outlier Detection: Identify and optionally remove statistical outliers from raw data
  • Range Validation: Ensure values fall within acceptable biological ranges
  • Missing Data: Filter samples based on data completeness
  • Column Validation: Verify required columns exist with valid data
  • Synchronized Filtering: Apply quality checks to raw data and automatically filter interpolated data
  • Quality Tags: Mark samples with quality issues for downstream analysis

Quality Check Types

1. Outlier Detection

Detect outliers using statistical methods:

Z-Score Method

  • Description: Detects values outside mean ± k*std
  • Formula: |z| = |x - μ| / σ
  • Threshold: Default z=3 (99.7% of normal distribution)
  • Best For: Normally distributed data
  • Use Cases: Growth rates, substrate consumption

IQR (Interquartile Range) Method

  • Description: Detects values outside Q1-1.5IQR to Q3+1.5IQR
  • Formula: IQR = Q3 - Q1
  • Threshold: Default multiplier=1.5 (Tukey's rule)
  • Best For: Skewed distributions, robust to extreme values
  • Use Cases: Environmental parameters, batch-to-batch variation

Percentile Method

  • Description: Removes values in extreme percentiles
  • Formula: Keep values between pLow and pHigh
  • Threshold: Default 1%-99%
  • Best For: Known expected ranges
  • Use Cases: Removing initialization/shutdown periods

2. Value Range Validation

Validate data against expected biological ranges:

  • Min Threshold: Values below this are invalid
  • Max Threshold: Values above this are invalid
  • Action: Mark samples or remove invalid data points

Example Ranges:

  • Temperature: 15-45°C
  • pH: 4.0-9.0
  • Dissolved Oxygen: 0-100%
  • Biomass: 0-50 g/L

3. Missing Data Checks

Filter based on data completeness:

  • Max Missing Percentage: Reject samples with too many NaN values
  • Required Columns: Ensure critical columns exist and have data
  • Gap Detection: Identify time series with large temporal gaps

4. Custom Filters

Apply flexible pandas-style filters:

  • Expression-Based: e.g., "column > value"
  • Aggregate Functions: mean, std, min, max over entire timeseries
  • Cross-Column: Validate relationships (e.g., substrate + product balance)

Configuration Parameters

Outlier Detection Parameters

outlier_method (String)

  • Default: "none"
  • Options: "none", "zscore", "iqr", "percentile"
  • Description: Statistical method for outlier detection
  • Recommendation: Use "iqr" for biological data (robust to extremes)

outlier_threshold (Float)

  • Default: 3.0
  • Range: 1.0 to 10.0
  • For Z-Score: Number of standard deviations (3.0 = 99.7% confidence)
  • For IQR: Multiplier for interquartile range (1.5 = Tukey's rule)
  • Impact: Lower = stricter (more outliers detected)

outlier_percentile_low (Float)

  • Default: 1.0
  • Range: 0.0 to 50.0
  • Only For: percentile method
  • Description: Lower percentile cutoff (values below are outliers)

outlier_percentile_high (Float)

  • Default: 99.0
  • Range: 50.0 to 100.0
  • Only For: percentile method
  • Description: Upper percentile cutoff (values above are outliers)

outlier_columns (List[String])

  • Default: [] (all numeric columns)
  • Description: Columns to check for outliers (empty = all)
  • Example: ["Biomasse (g/L)", "DO2 (%)"]

outlier_action (String)

  • Default: "remove_rows"
  • Options: "remove_rows", "mark_only", "remove_sample"
  • Description: Action when outliers detected
    • "remove_rows": Delete rows with outliers
    • "mark_only": Add quality_warning tag but keep data
    • "remove_sample": Exclude entire sample from output

Range Validation Parameters

range_checks (List[ParamSet])

Define acceptable value ranges for columns:

[
    {
        'column': 'Temperature (°C)',
        'min_value': 15.0,
        'max_value': 45.0,
        'action': 'remove_rows'
    },
    {
        'column': 'pH',
        'min_value': 4.0,
        'max_value': 9.0,
        'action': 'mark_only'
    }
]
  • column: Column name to validate
  • min_value: Minimum acceptable value (None = no minimum)
  • max_value: Maximum acceptable value (None = no maximum)
  • action: "remove_rows", "mark_only", or "remove_sample"

Missing Data Parameters

max_missing_percentage (Float)

  • Default: 50.0
  • Range: 0.0 to 100.0
  • Description: Maximum percentage of NaN values allowed per sample
  • Action: Samples exceeding this are excluded from output

required_columns (List[String])

  • Default: []
  • Description: Columns that must exist with ≥ 1 non-NaN value
  • Example: ["Temps de culture (h)", "Biomasse (g/L)"]
  • Action: Samples missing these columns are excluded

Data Point Count Parameters

min_data_points (List[ParamSet])

Define minimum number of non-NaN measurements required for specific columns:

[
    {
        'column': 'Biomasse (g/L)',
        'min_count': 3,
        'action': 'remove_sample'
    },
    {
        'column': 'DO2 (%)',
        'min_count': 5,
        'action': 'mark_only'
    }
]
  • column: Column name to check
  • min_count: Minimum number of non-NaN values required
  • action: "mark_only" or "remove_sample"
  • Use Case: Ensure sufficient data points for analysis (e.g., growth curve fitting needs ≥3 points)

Additional Parameters

add_quality_tags (Boolean)

  • Default: True
  • Description: Add quality check result tags to output Tables
  • Tags Added:
    • quality_check_passed: "true" or "false"
    • quality_warnings: Description of issues found
    • outliers_detected: Count of outliers found
    • missing_data_percentage: Percentage of missing values

Input Requirements

data (ResourceSet)

  • Source: FermentalgLoadData or Filter task output (raw data)
  • Requirements:
    • Must contain Table resources
    • Tables must have numeric data columns
    • Recommended: Temps de culture (h) column for time series
  • Tags Used:
    • batch, sample, medium (preserved in output)
  • Purpose: Quality checks are performed on this ResourceSet

interpolated_data (ResourceSet)

  • Source: Interpolation task output (interpolated time series data)
  • Requirements:
    • Must contain Table resources matching data ResourceSet
    • Resource names should match those in data ResourceSet
  • Tags Used:
    • batch, sample, medium (preserved in output)
  • Purpose: Filtered based on quality checks from data ResourceSet

Output Structure

filtered_data (ResourceSet)

Contains raw data Tables that passed all quality checks:

Data

  • Rows: May be reduced if outliers/invalid values removed
  • Columns: Same as input
  • Values: Outliers and out-of-range values removed (if configured)

Tags (Preserved + New)

  • Original: batch, sample, medium, missing_value
  • Quality Tags (if add_quality_tags=True):
    • quality_check_passed: Overall pass/fail
    • quality_warnings: Comma-separated list of warnings
    • outliers_detected: Number of outlier points found
    • missing_data_percentage: % of NaN values
    • range_violations: Number of range violations

filtered_interpolated_data (ResourceSet)

Contains interpolated data Tables for samples that passed quality checks:

Data

  • Rows: Same as input interpolated data (not modified)
  • Columns: Same as input
  • Values: Unchanged (filtering is sample-level, not value-level)

Tags (Preserved + New)

  • Original: batch, sample, medium
  • Quality Tags (if add_quality_tags=True):
    • quality_check_passed: "true" (only passing samples included)
  • Original: batch, sample, medium, missing_value
  • Quality Tags (if add_quality_tags=True):
    • quality_check_passed: Overall pass/fail
    • quality_warnings: Comma-separated list of warnings
    • outliers_detected: Number of outlier points found
    • missing_data_percentage: % of NaN values
    • range_violations: Number of range violations

Excluded Samples

  • Samples failing checks are logged but not included in output
  • Check log for list of excluded samples and reasons

Processing Logic

Execution Flow

  1. Input Validation: Check both ResourceSets contain Tables
  2. Selection Filtering: a. Extract all (batch, sample) couples from interpolated_data ResourceSet b. Filter data ResourceSet to only process samples matching these couples c. This ensures only selected samples are quality-checked
  3. Quality Check on Raw Data (per-sample): a. Extract Table and data from filtered data ResourceSet b. Check missing data percentage c. Validate required columns exist d. Check minimum data points per column (e.g., Biomasse needs ≥3 points) e. Apply range checks (per column) f. Detect outliers (per column) g. Take action (remove rows, mark, or exclude sample) h. Calculate quality metrics i. Add quality tags if enabled j. Track which samples passed
  4. Filter Interpolated Data: a. For each sample in interpolated_data ResourceSet b. Include only if corresponding sample in data passed checks c. Copy all tags and add quality_check_passed=true
  5. Output Assembly: Create filtered versions of both ResourceSets
  6. Summary Logging: Report statistics on checks performed

Decision Rules

  • Sample Pre-Filtered If:
    • (batch, sample) couple not present in interpolated_data ResourceSet
    • This ensures only selected samples are quality-checked
  • Sample Excluded If (from both outputs):
    • Missing data > max_missing_percentage (in raw data)
    • Missing any required_columns (in raw data)
    • Column has fewer data points than min_count (when action="remove_sample")
    • Any check with action="remove_sample" triggered (in raw data)
  • Rows Removed From Raw Data If:
    • Contains outlier (when action="remove_rows")
    • Value outside range (when action="remove_rows")
  • Interpolated Data:
    • Sample-level filtering only (no row removal)
    • Kept unchanged if corresponding raw data sample passed
  • Sample Marked If:
    • Issues detected but action="mark_only"
    • Warnings added to quality_warnings tag

Use Cases

1. Pre-Processing Pipeline

FermentalgLoadData
  ↓
Filter (select samples)
  ↓
Interpolation
  ↓
QualityCheck (remove outliers, validate ranges on raw data, filter interpolated)
  ↓
Analysis (use filtered_interpolated_data)

↓ Analysis


### 2. Outlier Removal for Publication

QualityCheck ( method=iqr, threshold=1.5, action=remove_rows ) → Clean dataset for figures


### 3. Data Quality Report

QualityCheck ( action=mark_only, add_quality_tags=True ) → Review quality_warnings tags → Decide which samples to exclude manually


### 4. Biological Range Validation

QualityCheck ( range_checks=[ {column: "Temperature", min: 20, max: 40}, {column: "pH", min: 5, max: 8}, {column: "DO2", min: 0, max: 100} ] ) → Ensure sensor data is valid


## Example Configurations

### Conservative Outlier Removal (IQR)
```python
{
    'outlier_method': 'iqr',
    'outlier_threshold': 1.5,
    'outlier_action': 'remove_rows',
    'add_quality_tags': True
}

Strict Quality Gate

{
    'outlier_method': 'zscore',
    'outlier_threshold': 2.5,
    'max_missing_percentage': 10.0,
    'required_columns': [
        'Temps de culture (h)',
        'Biomasse (g/L)',
        'Glucose (g/L)'
    ],
    'range_checks': [
        {'column': 'pH', 'min_value': 5.0, 'max_value': 8.5, 'action': 'remove_sample'}
    ]
}

Quality Audit (Mark Only)

{
    'outlier_method': 'percentile',
    'outlier_percentile_low': 2.0,
    'outlier_percentile_high': 98.0,
    'outlier_action': 'mark_only',
    'add_quality_tags': True
}

Minimum Data Points for Analysis

{
    'min_data_points': [
        {'column': 'Biomasse (g/L)', 'min_count': 3, 'action': 'remove_sample'},
        {'column': 'Glucose (g/L)', 'min_count': 3, 'action': 'remove_sample'},
        {'column': 'DO2 (%)', 'min_count': 5, 'action': 'mark_only'}
    ],
    'add_quality_tags': True
}

Use Case: Ensure sufficient measurements for growth curve fitting (needs ≥3 points)

Troubleshooting

Issue Cause Solution
All samples removed Thresholds too strict Relax thresholds, use mark_only first
No outliers detected Threshold too high Lower outlier_threshold
Wrong columns checked Column names mismatch Verify exact column names (case-sensitive)
Too many rows removed Action on wrong columns Use outlier_columns to specify target columns

Best Practices

  1. Start Lenient: Use mark_only first to see quality distribution
  2. Review Distributions: Check data before setting range thresholds
  3. Use IQR for Biology: Biological data often isn't normal
  4. Log Review: Always check log for excluded samples and reasons
  5. Preserve Originals: Keep unfiltered data for comparison
  6. Document Settings: Record quality check parameters with results

Scientific Considerations

Outlier Detection Sensitivity

  • Z-Score: Assumes normal distribution (rare in biology)
  • IQR: Robust, works with skewed data (recommended)
  • Percentile: Good for known distributions

Common Fermentation Ranges

  • Temperature: 15-45°C (organism dependent)
  • pH: 4-9 (process dependent)
  • DO2: 0-100% saturation
  • Biomass: 0-50 g/L (typical range)
  • Substrates/Products: 0-200 g/L (depends on strain)

When to Remove vs Mark

  • Remove: Clear sensor errors, initialization artifacts
  • Mark: Biological variation, borderline outliers
  • Remove Sample: Systemic failure, contamination

Notes

  • All quality checks are optional (can disable all for pass-through)
  • Original ResourceSet is never modified
  • Quality tags enable downstream filtering/analysis
  • Compatible with all Fermentalg workflow tasks
  • Processing is per-sample (samples don't affect each other)

Input

Input Data ResourceSet to check
ResourceSet containing fermentalg time series data to validate
Input Interpolated data ResourceSet to check
ResourceSet containing fermentalg time series interpolated data

Output

Quality-checked ResourceSet
ResourceSet containing only samples that passed quality checks
Quality-checked Interpolated ResourceSet
ResourceSet containing only samples that passed quality checks for interpolated data

Configuration

outlier_method

Optional

Method: none, zscore, iqr, percentile

Type : stringAllowed values : none zscore iqr percentile Default value : none

outlier_threshold

Optional

Threshold for zscore (std) or iqr (multiplier) methods

Type : floatDefault value : 3

outlier_percentile_low

Optional

Lower percentile for outlier detection (percentile method)

Type : floatDefault value : 1

outlier_percentile_high

Optional

Upper percentile for outlier detection (percentile method)

Type : floatDefault value : 99

outlier_columns

Optional

Comma-separated column names (empty = all numeric columns)

Type : string

outlier_action

Optional

Action: remove_rows (delete outlier points), mark_only (tag), remove_sample (exclude entire sample)

Type : stringAllowed values : remove_rows mark_only remove_sample Default value : remove_rows

range_checks

Optional

List of column range validations to apply

Type : ListMaximum occurrences number : -1

column

Column to validate

Type : string

min_value

Optional

Minimum acceptable value (None = no limit)

Type : float

max_value

Optional

Maximum acceptable value (None = no limit)

Type : float

action

Optional

Action: remove_rows, mark_only, remove_sample

Type : stringAllowed values : remove_rows mark_only remove_sample Default value : remove_rows

max_missing_percentage

Optional

Maximum % of NaN values allowed per sample (0-100)

Type : floatDefault value : 50

required_columns

Optional

Comma-separated list of columns that must exist with data

Type : string

min_data_points

Optional

List of minimum data point count validations per column

Type : ListMaximum occurrences number : -1

column

Column to check for minimum data points

Type : string

min_count

Optional

Minimum number of non-NaN values required

Type : floatDefault value : 3

action

Optional

Action: mark_only or remove_sample

Type : stringAllowed values : mark_only remove_sample Default value : remove_sample

add_quality_tags

Optional

Add quality check result tags to output Tables

Type : boolDefault value : true
Technical bricks to reuse or customize

Have you developed a brick?

Share it to accelerate projects for the entire community.