Converting Snakemake Workflows to Constellab

Converting Snakemake Workflows to Constellab

This guide explains how to migrate your Snakemake pipelines to the Constellab platform using AI-assisted conversion.

Simple Conversion Powered by AI

Converting Snakemake pipelines to Constellab is remarkably simple thanks to AI automation. The AI assistant analyzes your Snakemake workflow, understands the rule logic, and automatically generates equivalent Constellab tasks with minimal user intervention. You don't need to manually rewrite code or understand the intricacies of both systems—the AI handles the heavy lifting for you.

Key benefits:

🤖 Fully automated analysis of your Snakemake workflow structure

⚡ Instant task generation with proper inputs, outputs, and configurations

🎯 Smart script conversion from bash to Python for better maintainability

📝 Automatic documentation generation for all converted tasks

✅ Built-in validation to ensure conversion accuracy

Simply point the AI to your Snakefile, answer a few validation questions, and let it create production-ready Constellab tasks in seconds.

What is Snakemake?

Snakemake is a workflow management system that enables reproducible and scalable data analyses. It uses a Python-based language to define computational pipelines composed of rules connected via input/output file dependencies.

Key features:

Rule-based workflows: Define independent computational steps (rules) that are connected via file dependencies

Reproducibility: Environment management through Conda/Mamba integration ensures consistent results

Scalability: Execute workflows in parallel across local machines, HPC clusters, or cloud platforms

Python-based DSL: Write workflows using a familiar Python-based domain-specific language

Automatic parallelization: Snakemake automatically determines which rules can run in parallel based on dependencies

Snakemake has become a standard tool in bioinformatics and scientific computing, particularly for genomics and data analysis workflows.

Snakemake vs. Constellab: A Comparison

Both Snakemake and Constellab are open-source workflow management systems, but they serve different needs:

Key Advantages of Constellab

Constellab's main strength is being a complete platform, not just a workflow engine:

Unified Interface: Manage pipelines, data, experiments, and results in one place

Full Traceability: Every execution, parameter change, and data transformation is tracked and auditable

Data-Centric: First-class data management with visualization, annotation, and sharing capabilities

Collaboration: Teams can share protocols, datasets, and results seamlessly

Extensibility: Modular "brick" architecture allows easy addition of new capabilities

No-Code Options: Build and run workflows through the GUI without writing code

Hybrid Approach: Supports both GUI-based and code-based workflow development

What is Supported in the Conversion?

The AI-powered conversion supports a wide range of Snakemake workflow features:

Workflow Structures

✅ Standard Rules: Single-level rule definitions

✅ Complex Workflows: Multi-step pipelines with file dependencies

✅ Wildcard Patterns: Handling of sample patterns and batch processing

Rule Execution Types

1. Shell Command Rules ✅

Shell/Bash scripts embedded in shell: blocks

Option to convert to Python: AI can translate bash commands to Python code for:Better maintainability and code readabilityEasier debugging with proper error messagesCross-platform compatibilityIntegration with Python data science libraries

Better maintainability and code readability

Easier debugging with proper error messages

Cross-platform compatibility

Integration with Python data science libraries

Example: fastqc {input} -o {output} → Python with subprocess or BioPython

2. Python Script Rules ✅

Rules that execute external Python scripts (script: directive)

Rules with inline Python code blocks (run: directive)

Automatic integration of Python logic into Constellab task run() methods

Preservation of Python dependencies and imports

3. Virtual Environment Rules ✅

Rules using Conda environments (conda: directive)

Rules using Bioconda packages (common in bioinformatics)

Rules with Mamba for faster environment solving

Converted to use MambaShellProxy, CondaShellProxy, or PipShellProxy in Constellab

Automatic handling of environment YAML files

Example: conda: "envs/qc.yaml" → MambaShellProxy with env file

4. Docker Container Rules 🧪 (Beta)

Rules using Docker containers (container: directive)

Rules with Singularity images

Converted to use Constellab's DockerService for container orchestration

Automatic generation of docker-compose.yml files

Note: Beta feature—complex container configurations may require manual adjustments

Example: container: "docker://biocontainers/fastqc:0.11.9" → DockerService integration

5. Jupyter Notebook Rules ✅

Rules that execute Jupyter notebooks (notebook: directive)

Options to convert notebook cells to Python code or execute via nbconvert/papermill

Preservation of notebook outputs and visualizations

6. Wrapper Rules ✅

Rules using Snakemake wrappers (wrapper: directive)

AI analyzes wrapper functionality and implements equivalent logic

Support for common bioinformatics wrappers

Current Limitations

Protocol Template Generation 🚧 (Coming Soon)

The automatic generation of Constellab Protocol templates (equivalent to Snakemake workflows) is not yet available but is coming soon. Currently:

✅ Individual tasks are generated and immediately usable

✅ Tasks can be manually assembled into protocols using the GUI

✅ Task connections are documented in conversion output

🚧 Automatic protocol template creation is under development

Once available, the AI will generate complete protocol templates that:

Automatically connect tasks based on Snakemake rule dependencies

Set default parameters from Snakemake params and config

Preserve workflow execution order and file dependencies

Generate reusable templates ready for immediate use

Workaround: After task conversion, manually create protocol templates using the visual editor (see "Build Protocol Templates" section below).

What Gets Converted

Converting Snakemake to Constellab

Overview

Converting a Snakemake workflow to Constellab involves transforming each Snakemake rule into a Constellab Task. The AI-powered conversion command automates most of this process while preserving the original logic.

Conversion Concept

Using the AI Conversion Command

The conversion is performed using the AI command /gws-snakemake-to-constellab.

Step 1: Prepare Your Snakemake Workflow

Ensure your Snakemake workflow is accessible in your workspace:

# Example structure
/lab/user/
├── snakemake/
│   ├── Snakefile             # Your Snakemake workflow
│   ├── config.yaml           # Configuration file (optional)
│   ├── envs/                 # Conda environment files
│   │   ├── qc.yaml
│   │   └── analysis.yaml
│   └── scripts/              # External scripts (optional)
│       └── analyze.py

Step 2: Invoke the Conversion Command

In the Constellab AI assistant, use:

/gws-snakemake-to-constellab /lab/user/snakemake/Snakefile

Step 3: Review the Analysis

The AI will analyze your workflow and present:

Configuration Parameters: All config and params values with defaults

Wildcards: Pattern variables used across rules (e.g., {sample}, {replicate})

Rule Inventory: Each rule with its:Inputs and outputsExecution directive type (shell, script, run, etc.)Virtual environment requirementsBrief description of functionality

Workflow Structure: How rules are connected via file dependencies

Validation Questions:Which rules to convert?Convert bash scripts to Python or keep as shell commands?Handle external scripts inline or keep external?How to handle wildcards (config params vs batch processing)?Confirm task namesTarget brick location

Example Analysis Output:

## Snakemake Workflow Analysis

### Configuration
- Config values: min_length = 10, quality_threshold = 20
- Wildcards: {sample} - values inferred from input files

### Rules to Convert

#### 1. quality_control
- **Input**: FASTQ file (`data/raw/{sample}.fastq`)
- **Output**: QC file (`results/qc/{sample}_qc.txt`), Stats file (`results/qc/{sample}_stats.txt`)
- **Params**: quality_threshold = 20
- **Threads**: 2
- **Execution**: `shell:` directive with FastQC command
- **Environment**: conda environment (`envs/qc.yaml`)
- **Description**: Performs quality control on raw sequencing data using FastQC

#### 2. trim_reads
- **Input**: FASTQ file (`data/raw/{sample}.fastq`), QC file (dependency)
- **Output**: Trimmed FASTQ (`data/trimmed/{sample}_trimmed.fastq`)
- **Params**: min_length = 10
- **Execution**: `shell:` directive with cutadapt command
- **Description**: Trims low-quality reads based on minimum length threshold

#### 3. analyze
- **Input**: Trimmed FASTQ file
- **Output**: Analysis results (`results/analysis/{sample}_analysis.txt`)
- **Params**: threshold = 0.05
- **Threads**: 2
- **Execution**: `script:` directive → `scripts/analyze.py`
- **Environment**: conda environment (`envs/analysis.yaml`)
- **Description**: Analyzes trimmed data using external Python script

#### 4. generate_report
- **Input**: QC stats, Analysis results
- **Output**: Report (`results/{sample}_report.txt`)
- **Execution**: `run:` directive with inline Python code
- **Description**: Generates comprehensive sample report combining QC and analysis

### Workflow Structure
quality_control → trim_reads → analyze → generate_report

### Wildcards
- {sample}: Processes multiple samples (inferred: sample1, sample2, sample3)
  - **Conversion Strategy Options**:
    - Option A: Create config parameter for sample name (user specifies sample)
    - Option B: Process multiple samples in a single task (batch processing)

---

**Questions for Validation:**
1. Should I convert all 4 rules as tasks? (Yes/No or specify which ones)
2. For `shell:` directives, should I convert bash scripts to Python code? (Recommended: Yes for better maintainability)
3. For `script:` directives, should I inline the Python script or keep it external? (Recommended: Inline for single-file tasks)
4. How should I handle wildcards?
   - Option A: Config parameter (user specifies sample name at runtime)
   - Option B: Batch processing (process all samples in one task execution)
5. Confirm task names: `QualityControl`, `TrimReads`, `Analyze`, `GenerateReport`?
6. Where should I create these tasks? (brick name or create new brick, e.g., "gws_genomics")

Please confirm before I proceed with implementation.

Step 4: Confirm Conversion

Answer the validation questions:

Yes, convert all 4 rules.
Convert bash to Python where possible.
Inline the Python script for the analyze rule.
Use config parameter for sample name (Option A).
Task names are good.
Create in a new brick called "gws_genomics".

Step 5: Automatic Task Generation

The AI will:

Generate Constellab Task classes for each rule

Convert Snakemake execution logic to Python

Map inputs/outputs to Constellab resources (File, Folder, etc.)

Create configuration parameters from Snakemake params and wildcards

Handle virtual environments (conda, container)

Add comprehensive documentation

Generate unit tests

Each generated task will be a standalone Python file in your specified brick:

bricks/
└── gws_genomics/
    └── src/
        └── gws_genomics/
            ├── quality_control.py
            ├── trim_reads.py
            ├── analyze.py
            ├── generate_report.py
            └── ...

After Conversion: Using Your Tasks

1. Tasks are Available in the System

Once converted, your tasks are immediately available in Constellab:

Navigate to the Task Library in the web interface

Search for your newly created tasks (e.g., "QualityControl", "TrimReads")

View task documentation, inputs, outputs, and parameters

2. Create Scenarios

You can now run individual tasks as scenarios:

Go to Scenarios → Create New Scenario

Select a task from the library

Configure inputs and parameters

Execute and monitor the scenario

View results and execution traces

3. Build Protocol Templates (Recommended)

The Protocol Template is Constellab's equivalent to a Snakemake workflow. It allows you to:

Chain tasks together in a logical sequence

Define data flow between tasks

Set default configurations for reproducible workflows

Share and reuse complete pipelines

Creating a Protocol Template

Option A: Visual Protocol Editor (GUI)

Go to Protocols → Create New Protocol

Drag and drop tasks from the library onto the canvas

Connect task outputs to inputs by drawing links

Configure default parameters for each task

Add annotations and documentation

Save as a template

Example Protocol Structure for the Genomics Workflow:

[QualityControl] → [TrimReads] → [Analyze] → [GenerateReport]
     ↓                  ↓             ↓              ↓
  qc_file          trimmed_file  analysis_file   report
  stats_file

Benefits of Protocol Templates

Reusability: Apply the same workflow to different datasets

Consistency: Ensure standardized processing across experiments

Collaboration: Share templates with team members

Versioning: Track changes to workflow structure over time

Parameterization: Create flexible templates with adjustable parameters

4. Execute Complete Workflows

Once your protocol template is created:

Create a Scenario from Template:Select your protocol templateProvide input dataAdjust parameters if neededRun the entire pipeline

Monitor Execution:Real-time progress trackingView logs for each taskInspect intermediate results

Access Results:Download output filesVisualize data with built-in viewsExport results to external systems

Trace Execution:Full audit trail of what ran, when, and with what parametersReproduce results by re-running with identical settingsCompare different runs side-by-side

Detailed Conversion Examples

Example 1: Shell Command Rule

Original Snakemake Rule:

rule quality_control:
    input:
        "data/raw/{sample}.fastq"
    output:
        qc="results/qc/{sample}_qc.txt",
        stats="results/qc/{sample}_stats.txt"
    params:
        quality_threshold=20
    threads: 2
    conda: "envs/qc.yaml"
    shell:
        """
        fastqc {input} -o results/qc -t {threads}
        echo "Quality: {params.quality_threshold}" > {output.stats}
        """

Converted Constellab Task:

from gws_core import Task, task_decorator, ConfigParams, TaskInputs, TaskOutputs
from gws_core import InputSpec, OutputSpec, InputSpecs, OutputSpecs, ConfigSpecs
from gws_core import File, StrParam, IntParam, MambaShellProxy
import os

@task_decorator(
    unique_name="QualityControl",
    human_name="Quality Control",
    short_description="Perform quality control on sequencing data"
)
class QualityControl(Task):
    """
    [Generated by Snakemake to Task Converter]

    Converted from Snakemake rule: quality_control

    ## Description
    Performs quality control analysis on raw FASTQ files using FastQC.
    Generates QC reports and quality statistics.
    """

    input_specs = InputSpecs({
        'fastq_file': InputSpec(
            File,
            human_name="FASTQ file",
            short_description="Raw sequencing data file"
        )
    })

    output_specs = OutputSpecs({
        'qc_file': OutputSpec(
            File,
            human_name="QC file",
            short_description="Quality control results"
        ),
        'stats_file': OutputSpec(
            File,
            human_name="Stats file",
            short_description="Quality statistics"
        )
    })

    config_specs = ConfigSpecs({
        'sample_name': StrParam(
            human_name="Sample name",
            short_description="Name of the sample (from wildcard {sample})",
            default_value="sample1"
        ),
        'quality_threshold': IntParam(
            human_name="Quality threshold",
            short_description="Minimum quality score threshold",
            default_value=20
        ),
        'threads': IntParam(
            human_name="Threads",
            short_description="Number of CPU threads to use",
            default_value=2,
            min_value=1
        )
    })

    def run(self, params: ConfigParams, inputs: TaskInputs) -> TaskOutputs:
        # Get inputs
        fastq_file = inputs['fastq_file']

        # Get parameters
        sample_name = params.get_value('sample_name')
        quality_threshold = params.get_value('quality_threshold')
        threads = params.get_value('threads')

        # Create output paths
        tmp_dir = self.create_tmp_dir()
        qc_path = os.path.join(tmp_dir, f"{sample_name}_qc.txt")
        stats_path = os.path.join(tmp_dir, f"{sample_name}_stats.txt")

        # Get conda environment file path
        env_file_path = os.path.join(os.path.dirname(__file__), "qc_env.yaml")

        # Execute with conda environment
        mamba = MambaShellProxy(
            env_file_path=env_file_path,
            env_name="qc_env",
            message_dispatcher=self.message_dispatcher
        )

        self.log_info_message(f"Running quality control for {sample_name} with {threads} threads...")

        # Run FastQC
        mamba.run(f"fastqc {fastq_file.path} -o {tmp_dir} -t {threads}")

        # Generate stats file
        with open(stats_path, 'w') as f:
            f.write(f"Quality: {quality_threshold}\n")

        self.log_success_message("Quality control completed successfully")

        # Return outputs
        return {
            'qc_file': File(qc_path),
            'stats_file': File(stats_path)
        }

Example 2: Python Script Rule

Original Snakemake Rule:

rule analyze:
    input:
        "data/trimmed/{sample}_trimmed.fastq"
    output:
        "results/analysis/{sample}_analysis.txt"
    params:
        threshold=0.05
    threads: 2
    conda: "envs/analysis.yaml"
    script:
        "scripts/analyze.py"

Converted Constellab Task (with inlined script):

from gws_core import Task, task_decorator, ConfigParams, TaskInputs, TaskOutputs
from gws_core import InputSpec, OutputSpec, InputSpecs, OutputSpecs, ConfigSpecs
from gws_core import File, StrParam, FloatParam, IntParam
import os

@task_decorator(
    unique_name="Analyze",
    human_name="Analyze",
    short_description="Analyze trimmed sequencing data"
)
class Analyze(Task):
    """
    [Generated by Snakemake to Task Converter]

    Converted from Snakemake rule: analyze

    ## Description
    Analyzes trimmed sequencing data using custom analysis logic.
    The original external script has been inlined into this task.
    """

    input_specs = InputSpecs({
        'trimmed_file': InputSpec(
            File,
            human_name="Trimmed FASTQ",
            short_description="Trimmed sequencing data"
        )
    })

    output_specs = OutputSpecs({
        'analysis_file': OutputSpec(
            File,
            human_name="Analysis results",
            short_description="Analysis output file"
        )
    })

    config_specs = ConfigSpecs({
        'sample_name': StrParam(
            human_name="Sample name",
            short_description="Name of the sample",
            default_value="sample1"
        ),
        'threshold': FloatParam(
            human_name="Threshold",
            short_description="Analysis threshold value",
            default_value=0.05
        ),
        'threads': IntParam(
            human_name="Threads",
            short_description="Number of CPU threads",
            default_value=2,
            min_value=1
        )
    })

    def run(self, params: ConfigParams, inputs: TaskInputs) -> TaskOutputs:
        # Get inputs
        trimmed_file = inputs['trimmed_file']

        # Get parameters
        sample_name = params.get_value('sample_name')
        threshold = params.get_value('threshold')
        threads = params.get_value('threads')

        # Create output path
        tmp_dir = self.create_tmp_dir()
        output_path = os.path.join(tmp_dir, f"{sample_name}_analysis.txt")

        # Inlined script logic from scripts/analyze.py
        self.log_info_message(f"Analyzing {sample_name} with threshold {threshold}...")

        with open(trimmed_file.path, 'r') as f:
            data = f.read()

        # Perform analysis (original script logic)
        result = f"Analysis Results for {sample_name}\n"
        result += f"{'=' * 40}\n"
        result += f"Threshold: {threshold}\n"
        result += f"Threads used: {threads}\n"
        result += f"\nInput data preview:\n{data[:200]}\n"

        # Write results
        with open(output_path, 'w') as f:
            f.write(result)

        self.log_success_message("Analysis completed")

        return {'analysis_file': File(output_path)}

Example 3: Python Run Block Rule

Original Snakemake Rule:

rule generate_report:
    input:
        qc="results/qc/{sample}_stats.txt",
        analysis="results/analysis/{sample}_analysis.txt"
    output:
        "results/{sample}_report.txt"
    run:
        with open(output[0], 'w') as f_out:
            f_out.write(f"Report for {wildcards.sample}\n")
            f_out.write("=" * 40 + "\n\n")

            with open(input.qc, 'r') as f:
                f_out.write("QC Statistics:\n")
                f_out.write(f.read() + "\n\n")

            with open(input.analysis, 'r') as f:
                f_out.write("Analysis Results:\n")
                f_out.write(f.read())

Converted Constellab Task:

from gws_core import Task, task_decorator, ConfigParams, TaskInputs, TaskOutputs
from gws_core import InputSpec, OutputSpec, InputSpecs, OutputSpecs, ConfigSpecs
from gws_core import File, StrParam
import os

@task_decorator(
    unique_name="GenerateReport",
    human_name="Generate Report",
    short_description="Generate analysis report from QC and analysis results"
)
class GenerateReport(Task):
    """
    [Generated by Snakemake to Task Converter]

    Converted from Snakemake rule: generate_report

    ## Description
    Combines QC statistics and analysis results into a comprehensive report.
    """

    input_specs = InputSpecs({
        'qc_stats': InputSpec(
            File,
            human_name="QC statistics",
            short_description="Quality control statistics file"
        ),
        'analysis_results': InputSpec(
            File,
            human_name="Analysis results",
            short_description="Analysis output file"
        )
    })

    output_specs = OutputSpecs({
        'report': OutputSpec(
            File,
            human_name="Report",
            short_description="Combined analysis report"
        )
    })

    config_specs = ConfigSpecs({
        'sample_name': StrParam(
            human_name="Sample name",
            short_description="Name of the sample",
            default_value="sample1"
        )
    })

    def run(self, params: ConfigParams, inputs: TaskInputs) -> TaskOutputs:
        # Get inputs
        qc_stats = inputs['qc_stats']
        analysis_results = inputs['analysis_results']

        # Get parameters
        sample_name = params.get_value('sample_name')

        # Create output path
        tmp_dir = self.create_tmp_dir()
        output_path = os.path.join(tmp_dir, f"{sample_name}_report.txt")

        # Logic from run: block (directly converted)
        with open(output_path, 'w') as f_out:
            f_out.write(f"Report for {sample_name}\n")
            f_out.write("=" * 40 + "\n\n")

            with open(qc_stats.path, 'r') as f:
                f_out.write("QC Statistics:\n")
                f_out.write(f.read() + "\n\n")

            with open(analysis_results.path, 'r') as f:
                f_out.write("Analysis Results:\n")
                f_out.write(f.read())

        self.log_success_message(f"Report generated for {sample_name}")

        return {'report': File(output_path)}

Handling Special Cases

Wildcards and Batch Processing

Snakemake uses wildcards to process multiple samples. Constellab offers two strategies:

Strategy A: Config Parameter (Single Sample)

Each task execution processes one sample

User specifies sample name as a config parameter

Multiple samples require multiple task executions or protocol runs

Strategy B: Batch Processing (Multiple Samples)

Task processes all samples in a single execution

Use ListParam for sample names

Loop through samples within the task

Example (Batch Processing):

from gws_core import ListParam

config_specs = ConfigSpecs({
    'sample_names': ListParam(
        human_name="Sample names",
        short_description="List of sample names to process",
        default_value=["sample1", "sample2", "sample3"]
    )
})

def run(self, params: ConfigParams, inputs: TaskInputs) -> TaskOutputs:
    sample_names = params.get_value('sample_names')

    results = []
    for sample in sample_names:
        self.log_info_message(f"Processing {sample}...")
        # Process each sample
        # ...

    return outputs

Virtual Environments

Snakemake's conda: directive is converted to MambaShellProxy:

Original:

rule align:
    conda: "envs/alignment.yaml"
    shell: "bwa mem ..."

Converted:

env_file_path = os.path.join(os.path.dirname(__file__), "alignment_env.yaml")
mamba = MambaShellProxy(
    env_file_path=env_file_path,
    env_name="alignment_env",
    message_dispatcher=self.message_dispatcher
)
mamba.run("bwa mem ...")

Docker Containers (Beta)

Snakemake's container: directive is converted to DockerService:

Summary

Converting from Snakemake to Constellab offers:

✅ Automated conversion with AI assistance

✅ Preserved logic from original Snakemake rules

✅ Enhanced traceability and audit trails

✅ User-friendly interface for non-programmers

✅ Complete platform for data and pipeline management

✅ Flexible deployment in any environment

✅ Reusable protocol templates for standardized workflows

✅ Better maintainability through Python-based tasks

Start your conversion today with /gws-snakemake-to-constellab and experience the power of a complete data lab automation platform!