Apply quality checks and filters to fermentation data (outliers, ranges, missing values)
[Generated by Task Expert Agent]
Apply configurable quality checks to fermentation time series data.
Overview
This task performs data quality validation and filtering on fermentation ResourceSets.
It performs quality checks on the raw data ResourceSet and filters both raw and interpolated
data based on which samples pass the quality checks. Designed to work with output from
FermentalgLoadData or FilterFermentorAnalyseLoadedResourceSetBySelection.
Purpose
- Outlier Detection: Identify and optionally remove statistical outliers from raw data
- Range Validation: Ensure values fall within acceptable biological ranges
- Missing Data: Filter samples based on data completeness
- Column Validation: Verify required columns exist with valid data
- Synchronized Filtering: Apply quality checks to raw data and automatically filter interpolated data
- Quality Tags: Mark samples with quality issues for downstream analysis
Quality Check Types
1. Outlier Detection
Detect outliers using statistical methods:
Z-Score Method
- Description: Detects values outside mean ± k*std
- Formula: |z| = |x - μ| / σ
- Threshold: Default z=3 (99.7% of normal distribution)
- Best For: Normally distributed data
- Use Cases: Growth rates, substrate consumption
IQR (Interquartile Range) Method
- Description: Detects values outside Q1-1.5IQR to Q3+1.5IQR
- Formula: IQR = Q3 - Q1
- Threshold: Default multiplier=1.5 (Tukey's rule)
- Best For: Skewed distributions, robust to extreme values
- Use Cases: Environmental parameters, batch-to-batch variation
Percentile Method
- Description: Removes values in extreme percentiles
- Formula: Keep values between pLow and pHigh
- Threshold: Default 1%-99%
- Best For: Known expected ranges
- Use Cases: Removing initialization/shutdown periods
2. Value Range Validation
Validate data against expected biological ranges:
- Min Threshold: Values below this are invalid
- Max Threshold: Values above this are invalid
- Action: Mark samples or remove invalid data points
Example Ranges:
- Temperature: 15-45°C
- pH: 4.0-9.0
- Dissolved Oxygen: 0-100%
- Biomass: 0-50 g/L
3. Missing Data Checks
Filter based on data completeness:
- Max Missing Percentage: Reject samples with too many NaN values
- Required Columns: Ensure critical columns exist and have data
- Gap Detection: Identify time series with large temporal gaps
4. Custom Filters
Apply flexible pandas-style filters:
- Expression-Based: e.g., "column > value"
- Aggregate Functions: mean, std, min, max over entire timeseries
- Cross-Column: Validate relationships (e.g., substrate + product balance)
Configuration Parameters
Outlier Detection Parameters
outlier_method (String)
- Default:
"none"
- Options:
"none", "zscore", "iqr", "percentile"
- Description: Statistical method for outlier detection
- Recommendation: Use
"iqr" for biological data (robust to extremes)
outlier_threshold (Float)
- Default: 3.0
- Range: 1.0 to 10.0
- For Z-Score: Number of standard deviations (3.0 = 99.7% confidence)
- For IQR: Multiplier for interquartile range (1.5 = Tukey's rule)
- Impact: Lower = stricter (more outliers detected)
outlier_percentile_low (Float)
- Default: 1.0
- Range: 0.0 to 50.0
- Only For:
percentile method
- Description: Lower percentile cutoff (values below are outliers)
outlier_percentile_high (Float)
- Default: 99.0
- Range: 50.0 to 100.0
- Only For:
percentile method
- Description: Upper percentile cutoff (values above are outliers)
outlier_columns (List[String])
- Default: [] (all numeric columns)
- Description: Columns to check for outliers (empty = all)
- Example:
["Biomasse (g/L)", "DO2 (%)"]
outlier_action (String)
- Default:
"remove_rows"
- Options:
"remove_rows", "mark_only", "remove_sample"
- Description: Action when outliers detected
"remove_rows": Delete rows with outliers
"mark_only": Add quality_warning tag but keep data
"remove_sample": Exclude entire sample from output
Range Validation Parameters
range_checks (List[ParamSet])
Define acceptable value ranges for columns:
[
{
'column': 'Temperature (°C)',
'min_value': 15.0,
'max_value': 45.0,
'action': 'remove_rows'
},
{
'column': 'pH',
'min_value': 4.0,
'max_value': 9.0,
'action': 'mark_only'
}
]
- column: Column name to validate
- min_value: Minimum acceptable value (None = no minimum)
- max_value: Maximum acceptable value (None = no maximum)
- action:
"remove_rows", "mark_only", or "remove_sample"
Missing Data Parameters
max_missing_percentage (Float)
- Default: 50.0
- Range: 0.0 to 100.0
- Description: Maximum percentage of NaN values allowed per sample
- Action: Samples exceeding this are excluded from output
required_columns (List[String])
- Default: []
- Description: Columns that must exist with ≥ 1 non-NaN value
- Example:
["Temps de culture (h)", "Biomasse (g/L)"]
- Action: Samples missing these columns are excluded
Data Point Count Parameters
min_data_points (List[ParamSet])
Define minimum number of non-NaN measurements required for specific columns:
[
{
'column': 'Biomasse (g/L)',
'min_count': 3,
'action': 'remove_sample'
},
{
'column': 'DO2 (%)',
'min_count': 5,
'action': 'mark_only'
}
]
- column: Column name to check
- min_count: Minimum number of non-NaN values required
- action:
"mark_only" or "remove_sample"
- Use Case: Ensure sufficient data points for analysis (e.g., growth curve fitting needs ≥3 points)
Additional Parameters
add_quality_tags (Boolean)
- Default: True
- Description: Add quality check result tags to output Tables
- Tags Added:
quality_check_passed: "true" or "false"
quality_warnings: Description of issues found
outliers_detected: Count of outliers found
missing_data_percentage: Percentage of missing values
Input Requirements
data (ResourceSet)
- Source: FermentalgLoadData or Filter task output (raw data)
- Requirements:
- Must contain Table resources
- Tables must have numeric data columns
- Recommended:
Temps de culture (h) column for time series
- Tags Used:
batch, sample, medium (preserved in output)
- Purpose: Quality checks are performed on this ResourceSet
interpolated_data (ResourceSet)
- Source: Interpolation task output (interpolated time series data)
- Requirements:
- Must contain Table resources matching data ResourceSet
- Resource names should match those in data ResourceSet
- Tags Used:
batch, sample, medium (preserved in output)
- Purpose: Filtered based on quality checks from data ResourceSet
Output Structure
filtered_data (ResourceSet)
Contains raw data Tables that passed all quality checks:
Data
- Rows: May be reduced if outliers/invalid values removed
- Columns: Same as input
- Values: Outliers and out-of-range values removed (if configured)
Tags (Preserved + New)
- Original: batch, sample, medium, missing_value
- Quality Tags (if add_quality_tags=True):
quality_check_passed: Overall pass/fail
quality_warnings: Comma-separated list of warnings
outliers_detected: Number of outlier points found
missing_data_percentage: % of NaN values
range_violations: Number of range violations
filtered_interpolated_data (ResourceSet)
Contains interpolated data Tables for samples that passed quality checks:
Data
- Rows: Same as input interpolated data (not modified)
- Columns: Same as input
- Values: Unchanged (filtering is sample-level, not value-level)
Tags (Preserved + New)
- Original: batch, sample, medium
- Quality Tags (if add_quality_tags=True):
quality_check_passed: "true" (only passing samples included)
- Original: batch, sample, medium, missing_value
- Quality Tags (if add_quality_tags=True):
quality_check_passed: Overall pass/fail
quality_warnings: Comma-separated list of warnings
outliers_detected: Number of outlier points found
missing_data_percentage: % of NaN values
range_violations: Number of range violations
Excluded Samples
- Samples failing checks are logged but not included in output
- Check log for list of excluded samples and reasons
Processing Logic
Execution Flow
- Input Validation: Check both ResourceSets contain Tables
- Selection Filtering:
a. Extract all (batch, sample) couples from
interpolated_data ResourceSet
b. Filter data ResourceSet to only process samples matching these couples
c. This ensures only selected samples are quality-checked
- Quality Check on Raw Data (per-sample):
a. Extract Table and data from filtered
data ResourceSet
b. Check missing data percentage
c. Validate required columns exist
d. Check minimum data points per column (e.g., Biomasse needs ≥3 points)
e. Apply range checks (per column)
f. Detect outliers (per column)
g. Take action (remove rows, mark, or exclude sample)
h. Calculate quality metrics
i. Add quality tags if enabled
j. Track which samples passed
- Filter Interpolated Data:
a. For each sample in
interpolated_data ResourceSet
b. Include only if corresponding sample in data passed checks
c. Copy all tags and add quality_check_passed=true
- Output Assembly: Create filtered versions of both ResourceSets
- Summary Logging: Report statistics on checks performed
Decision Rules
- Sample Pre-Filtered If:
- (batch, sample) couple not present in
interpolated_data ResourceSet
- This ensures only selected samples are quality-checked
- Sample Excluded If (from both outputs):
- Missing data > max_missing_percentage (in raw data)
- Missing any required_columns (in raw data)
- Column has fewer data points than min_count (when action="remove_sample")
- Any check with action="remove_sample" triggered (in raw data)
- Rows Removed From Raw Data If:
- Contains outlier (when action="remove_rows")
- Value outside range (when action="remove_rows")
- Interpolated Data:
- Sample-level filtering only (no row removal)
- Kept unchanged if corresponding raw data sample passed
- Sample Marked If:
- Issues detected but action="mark_only"
- Warnings added to quality_warnings tag
Use Cases
1. Pre-Processing Pipeline
FermentalgLoadData
↓
Filter (select samples)
↓
Interpolation
↓
QualityCheck (remove outliers, validate ranges on raw data, filter interpolated)
↓
Analysis (use filtered_interpolated_data)
↓
Analysis
QualityCheck (
method=iqr,
threshold=1.5,
action=remove_rows
)
→ Clean dataset for figures
QualityCheck (
action=mark_only,
add_quality_tags=True
)
→ Review quality_warnings tags
→ Decide which samples to exclude manually
QualityCheck (
range_checks=[
{column: "Temperature", min: 20, max: 40},
{column: "pH", min: 5, max: 8},
{column: "DO2", min: 0, max: 100}
]
)
→ Ensure sensor data is valid
```python
{
'outlier_method': 'iqr',
'outlier_threshold': 1.5,
'outlier_action': 'remove_rows',
'add_quality_tags': True
}
Strict Quality Gate
{
'outlier_method': 'zscore',
'outlier_threshold': 2.5,
'max_missing_percentage': 10.0,
'required_columns': [
'Temps de culture (h)',
'Biomasse (g/L)',
'Glucose (g/L)'
],
'range_checks': [
{'column': 'pH', 'min_value': 5.0, 'max_value': 8.5, 'action': 'remove_sample'}
]
}
Quality Audit (Mark Only)
{
'outlier_method': 'percentile',
'outlier_percentile_low': 2.0,
'outlier_percentile_high': 98.0,
'outlier_action': 'mark_only',
'add_quality_tags': True
}
Minimum Data Points for Analysis
{
'min_data_points': [
{'column': 'Biomasse (g/L)', 'min_count': 3, 'action': 'remove_sample'},
{'column': 'Glucose (g/L)', 'min_count': 3, 'action': 'remove_sample'},
{'column': 'DO2 (%)', 'min_count': 5, 'action': 'mark_only'}
],
'add_quality_tags': True
}
Use Case: Ensure sufficient measurements for growth curve fitting (needs ≥3 points)
Troubleshooting
| Issue |
Cause |
Solution |
| All samples removed |
Thresholds too strict |
Relax thresholds, use mark_only first |
| No outliers detected |
Threshold too high |
Lower outlier_threshold |
| Wrong columns checked |
Column names mismatch |
Verify exact column names (case-sensitive) |
| Too many rows removed |
Action on wrong columns |
Use outlier_columns to specify target columns |
Best Practices
- Start Lenient: Use
mark_only first to see quality distribution
- Review Distributions: Check data before setting range thresholds
- Use IQR for Biology: Biological data often isn't normal
- Log Review: Always check log for excluded samples and reasons
- Preserve Originals: Keep unfiltered data for comparison
- Document Settings: Record quality check parameters with results
Scientific Considerations
Outlier Detection Sensitivity
- Z-Score: Assumes normal distribution (rare in biology)
- IQR: Robust, works with skewed data (recommended)
- Percentile: Good for known distributions
Common Fermentation Ranges
- Temperature: 15-45°C (organism dependent)
- pH: 4-9 (process dependent)
- DO2: 0-100% saturation
- Biomass: 0-50 g/L (typical range)
- Substrates/Products: 0-200 g/L (depends on strain)
When to Remove vs Mark
- Remove: Clear sensor errors, initialization artifacts
- Mark: Biological variation, borderline outliers
- Remove Sample: Systemic failure, contamination
Notes
- All quality checks are optional (can disable all for pass-through)
- Original ResourceSet is never modified
- Quality tags enable downstream filtering/analysis
- Compatible with all Fermentalg workflow tasks
- Processing is per-sample (samples don't affect each other)