downloaders

`lacuna.io.downloaders` ¶

Downloaders subpackage for source-specific download logic.

This module provides downloader implementations for various data sources: - Harvard Dataverse (for GSP1000) - Figshare (for dTOR985) - GitHub Releases (for HCP1065)

Each downloader handles authentication, rate limiting, and error handling specific to its data source.

`ConnectomeSource` `dataclass` ¶

Configuration for a fetchable connectome source.

Source code in src/lacuna/io/downloaders/base.py

@dataclass
class ConnectomeSource:
    """Configuration for a fetchable connectome source."""

    name: str
    """Unique identifier (e.g., 'gsp1000', 'dtor985')."""

    display_name: str
    """Human-readable name (e.g., 'GSP1000 Functional Connectome')."""

    type: Literal["functional", "structural"]
    """Connectome type determining processing pipeline."""

    description: str
    """User-facing description of the connectome."""

    source_type: Literal["dataverse", "figshare", "github"]
    """Download source requiring specific authentication/handling."""

    # Dataverse-specific
    persistent_id: str | None = None
    """DOI for Dataverse datasets (e.g., 'doi:10.7910/DVN/ILXIKS')."""

    dataverse_server: str = "https://dataverse.harvard.edu"
    """Dataverse server URL."""

    # Figshare-specific
    download_url: str | None = None
    """Direct download URL for Figshare files (deprecated, use article_id)."""

    article_id: int | None = None
    """Figshare article ID for API-based downloads."""

    # Processing
    default_batches: int = 10
    """Default number of HDF5 batches (functional only)."""

    requires_mask: bool = False
    """Whether brain mask is needed for processing."""

    mask_url: str | None = None
    """URL to download brain mask if required."""

    # Metadata
    n_subjects: int = 0
    """Number of subjects in the connectome."""

    space: str = "MNI152NLin6Asym"
    """Coordinate space."""

    estimated_size_gb: float = 0.0
    """Estimated download size in GB for user information."""

    citation: str = ""
    """Citation text for this connectome dataset."""

`article_id = None` `class-attribute` `instance-attribute` ¶

Figshare article ID for API-based downloads.

`citation = ''` `class-attribute` `instance-attribute` ¶

Citation text for this connectome dataset.

`dataverse_server = 'https://dataverse.harvard.edu'` `class-attribute` `instance-attribute` ¶

Dataverse server URL.

`default_batches = 10` `class-attribute` `instance-attribute` ¶

Default number of HDF5 batches (functional only).

`description` `instance-attribute` ¶

User-facing description of the connectome.

`display_name` `instance-attribute` ¶

Human-readable name (e.g., 'GSP1000 Functional Connectome').

`download_url = None` `class-attribute` `instance-attribute` ¶

Direct download URL for Figshare files (deprecated, use article_id).

`estimated_size_gb = 0.0` `class-attribute` `instance-attribute` ¶

Estimated download size in GB for user information.

`mask_url = None` `class-attribute` `instance-attribute` ¶

URL to download brain mask if required.

`n_subjects = 0` `class-attribute` `instance-attribute` ¶

Number of subjects in the connectome.

`name` `instance-attribute` ¶

Unique identifier (e.g., 'gsp1000', 'dtor985').

`persistent_id = None` `class-attribute` `instance-attribute` ¶

DOI for Dataverse datasets (e.g., 'doi:10.7910/DVN/ILXIKS').

`requires_mask = False` `class-attribute` `instance-attribute` ¶

Whether brain mask is needed for processing.

`source_type` `instance-attribute` ¶

Download source requiring specific authentication/handling.

`space = 'MNI152NLin6Asym'` `class-attribute` `instance-attribute` ¶

Coordinate space.

`type` `instance-attribute` ¶

Connectome type determining processing pipeline.

`DataverseDownloader` ¶

Bases: BaseDownloader

Downloader for Harvard Dataverse datasets.

Handles authentication via API key and supports resumable downloads with checksum verification.

Parameters:

Name	Type	Description	Default
`source`	`ConnectomeSource`	Configuration for the connectome source.	required
`api_key`	`str`	Dataverse API key. If not provided, will attempt to get from environment variable or config file.	`None`

Raises:

Type	Description
`AuthenticationError`	If no API key is available.

Source code in src/lacuna/io/downloaders/dataverse.py

class DataverseDownloader(BaseDownloader):
    """
    Downloader for Harvard Dataverse datasets.

    Handles authentication via API key and supports resumable downloads
    with checksum verification.

    Parameters
    ----------
    source : ConnectomeSource
        Configuration for the connectome source.
    api_key : str, optional
        Dataverse API key. If not provided, will attempt to get from
        environment variable or config file.

    Raises
    ------
    AuthenticationError
        If no API key is available.
    """

    def __init__(
        self,
        source: ConnectomeSource,
        api_key: str | None = None,
    ):
        super().__init__(source)
        self.api_key = get_api_key(api_key)
        if self.api_key is None:
            raise AuthenticationError(
                source=source.name,
                reason=(
                    "Dataverse API key required. Set DATAVERSE_API_KEY environment "
                    "variable or use --api-key argument."
                ),
            )
        self.session = requests.Session()
        self.session.headers.update({"X-Dataverse-key": self.api_key})

    def download(
        self,
        output_path: Path,
        progress_callback: Callable[[FetchProgress], None] | None = None,
        test_mode: bool = False,
        skip_checksum: bool = False,
    ) -> list[Path]:
        """
        Download dataset files from Dataverse.

        Parameters
        ----------
        output_path : Path
            Directory to download files to.
        progress_callback : callable, optional
            Function called with FetchProgress updates.
        test_mode : bool, default=False
            If True, downloads only 1 tarball for testing the full pipeline.
            Metadata files (JSON, TXT, masks) are always downloaded.
        skip_checksum : bool, default=False
            If True, skip checksum verification. Use when server metadata
            is outdated and causes false checksum mismatches.

        Returns
        -------
        list[Path]
            List of downloaded file paths.

        Raises
        ------
        DownloadError
            If download fails.
        AuthenticationError
            If authentication fails.
        """
        output_path = Path(output_path)
        output_path.mkdir(parents=True, exist_ok=True)

        # Get dataset metadata
        files_info = self._get_dataset_files()

        # In test mode: download all metadata files + only 1 tarball
        if test_mode:
            metadata_files = []
            tar_files = []
            for f in files_info:
                filename = f.get("filename", "")
                # Metadata files: JSON, TXT, NIfTI masks - always download
                if filename.endswith((".json", ".txt", ".nii.gz", ".nii")):
                    metadata_files.append(f)
                # Tarballs: limit to 1 in test mode
                elif filename.endswith(".tar"):
                    tar_files.append(f)

            # Take only first tarball in test mode
            files_info = metadata_files + tar_files[:1]

        downloaded_files: list[Path] = []

        for i, file_info in enumerate(files_info):
            file_id = file_info["id"]
            filename = file_info["filename"]
            checksum = file_info.get("checksum")
            file_size = file_info.get("size", 0)

            output_file = output_path / filename

            # Report progress
            if progress_callback:
                progress_callback(
                    FetchProgress(
                        phase="download",
                        current_file=filename,
                        files_completed=i,
                        files_total=len(files_info),
                        bytes_total=file_size,
                        message=f"Downloading {filename}",
                    )
                )

            # Skip if already downloaded and checksum matches (unless skipping)
            if output_file.exists():
                if skip_checksum or (checksum and self._verify_checksum(output_file, checksum)):
                    downloaded_files.append(output_file)
                    continue

            # Download file
            self._download_file(
                file_id=file_id,
                output_file=output_file,
                expected_checksum=None if skip_checksum else checksum,
                progress_callback=progress_callback,
                file_index=i,
                total_files=len(files_info),
            )
            downloaded_files.append(output_file)

        return downloaded_files

    def _get_dataset_files(self) -> list[dict]:
        """
        Get list of files in the dataset.

        Returns
        -------
        list[dict]
            List of file metadata dicts with id, filename, checksum, size.

        Raises
        ------
        DownloadError
            If API request fails.
        """
        if not self.source.persistent_id:
            raise DownloadError(
                url=self.source.dataverse_server,
                reason="No persistent_id configured for source",
            )

        # Use the dataset files API
        url = (
            f"{self.source.dataverse_server}/api/datasets/"
            f":persistentId/versions/:latest/files"
            f"?persistentId={self.source.persistent_id}"
        )

        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            if response.status_code == 401:
                raise AuthenticationError(
                    source=self.source.name,
                    reason="Invalid API key",
                ) from e
            raise DownloadError(
                url=url,
                reason=f"HTTP {response.status_code}: {response.text}",
            ) from e
        except requests.exceptions.RequestException as e:
            raise DownloadError(url=url, reason=str(e)) from e

        data = response.json()
        if data.get("status") != "OK":
            raise DownloadError(
                url=url,
                reason=f"API error: {data.get('message', 'Unknown error')}",
            )

        files_info = []
        for file_data in data.get("data", []):
            df = file_data.get("dataFile", {})
            checksum_info = df.get("checksum", {})
            files_info.append(
                {
                    "id": df.get("id"),
                    "filename": df.get("filename", f"file_{df.get('id')}"),
                    "size": df.get("filesize", 0),
                    "checksum": checksum_info.get("value"),
                    "checksum_type": checksum_info.get("type", "MD5").lower(),
                }
            )

        return files_info

    def _download_file(
        self,
        file_id: int,
        output_file: Path,
        expected_checksum: str | None = None,
        progress_callback: Callable[[FetchProgress], None] | None = None,
        file_index: int = 0,
        total_files: int = 1,
    ) -> None:
        """
        Download a single file by ID.

        Parameters
        ----------
        file_id : int
            Dataverse file ID.
        output_file : Path
            Output file path.
        expected_checksum : str, optional
            Expected MD5 checksum for verification.
        progress_callback : callable, optional
            Progress callback function.
        file_index : int
            Current file index for progress.
        total_files : int
            Total number of files for progress.

        Raises
        ------
        DownloadError
            If download fails.
        ChecksumError
            If checksum verification fails.
        """
        url = f"{self.source.dataverse_server}/api/access/datafile/{file_id}"

        try:
            response = self.session.get(url, stream=True, timeout=60)
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            if response.status_code == 401:
                raise AuthenticationError(
                    source=self.source.name,
                    reason="Invalid API key",
                ) from e
            raise DownloadError(
                url=url,
                reason=f"HTTP {response.status_code}",
            ) from e
        except requests.exceptions.RequestException as e:
            raise DownloadError(url=url, reason=str(e)) from e

        total_size = int(response.headers.get("content-length", 0))
        hasher = hashlib.md5()

        # Use temp file for atomic write
        temp_file = output_file.with_suffix(output_file.suffix + ".tmp")

        try:
            with open(temp_file, "wb") as f:
                with tqdm(
                    total=total_size,
                    unit="B",
                    unit_scale=True,
                    desc=output_file.name,
                    disable=progress_callback is not None,  # Disable if using callback
                ) as pbar:
                    bytes_downloaded = 0
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            f.write(chunk)
                            hasher.update(chunk)
                            bytes_downloaded += len(chunk)
                            pbar.update(len(chunk))

                            if progress_callback:
                                progress_callback(
                                    FetchProgress(
                                        phase="download",
                                        current_file=output_file.name,
                                        files_completed=file_index,
                                        files_total=total_files,
                                        bytes_transferred=bytes_downloaded,
                                        bytes_total=total_size,
                                        message=f"Downloading {output_file.name}",
                                    )
                                )

            # Verify checksum
            if expected_checksum:
                actual_checksum = hasher.hexdigest()
                if actual_checksum.lower() != expected_checksum.lower():
                    temp_file.unlink()
                    raise ChecksumError(
                        filepath=str(output_file),
                        expected=expected_checksum,
                        actual=actual_checksum,
                    )

            # Move to final location
            temp_file.rename(output_file)

        except Exception:
            if temp_file.exists():
                temp_file.unlink()
            raise

    def _verify_checksum(self, filepath: Path, expected: str) -> bool:
        """Verify file MD5 checksum."""
        hasher = hashlib.md5()
        with open(filepath, "rb") as f:
            for chunk in iter(lambda: f.read(8192), b""):
                hasher.update(chunk)
        return hasher.hexdigest().lower() == expected.lower()

`download(output_path, progress_callback=None, test_mode=False, skip_checksum=False)` ¶

Download dataset files from Dataverse.

Parameters:

Name	Type	Description	Default
`output_path`	`Path`	Directory to download files to.	required
`progress_callback`	`callable`	Function called with FetchProgress updates.	`None`
`test_mode`	`bool`	If True, downloads only 1 tarball for testing the full pipeline. Metadata files (JSON, TXT, masks) are always downloaded.	`False`
`skip_checksum`	`bool`	If True, skip checksum verification. Use when server metadata is outdated and causes false checksum mismatches.	`False`

Returns:

Type	Description
`list[Path]`	List of downloaded file paths.

Raises:

Type	Description
`DownloadError`	If download fails.
`AuthenticationError`	If authentication fails.

Source code in src/lacuna/io/downloaders/dataverse.py

def download(
    self,
    output_path: Path,
    progress_callback: Callable[[FetchProgress], None] | None = None,
    test_mode: bool = False,
    skip_checksum: bool = False,
) -> list[Path]:
    """
    Download dataset files from Dataverse.

    Parameters
    ----------
    output_path : Path
        Directory to download files to.
    progress_callback : callable, optional
        Function called with FetchProgress updates.
    test_mode : bool, default=False
        If True, downloads only 1 tarball for testing the full pipeline.
        Metadata files (JSON, TXT, masks) are always downloaded.
    skip_checksum : bool, default=False
        If True, skip checksum verification. Use when server metadata
        is outdated and causes false checksum mismatches.

    Returns
    -------
    list[Path]
        List of downloaded file paths.

    Raises
    ------
    DownloadError
        If download fails.
    AuthenticationError
        If authentication fails.
    """
    output_path = Path(output_path)
    output_path.mkdir(parents=True, exist_ok=True)

    # Get dataset metadata
    files_info = self._get_dataset_files()

    # In test mode: download all metadata files + only 1 tarball
    if test_mode:
        metadata_files = []
        tar_files = []
        for f in files_info:
            filename = f.get("filename", "")
            # Metadata files: JSON, TXT, NIfTI masks - always download
            if filename.endswith((".json", ".txt", ".nii.gz", ".nii")):
                metadata_files.append(f)
            # Tarballs: limit to 1 in test mode
            elif filename.endswith(".tar"):
                tar_files.append(f)

        # Take only first tarball in test mode
        files_info = metadata_files + tar_files[:1]

    downloaded_files: list[Path] = []

    for i, file_info in enumerate(files_info):
        file_id = file_info["id"]
        filename = file_info["filename"]
        checksum = file_info.get("checksum")
        file_size = file_info.get("size", 0)

        output_file = output_path / filename

        # Report progress
        if progress_callback:
            progress_callback(
                FetchProgress(
                    phase="download",
                    current_file=filename,
                    files_completed=i,
                    files_total=len(files_info),
                    bytes_total=file_size,
                    message=f"Downloading {filename}",
                )
            )

        # Skip if already downloaded and checksum matches (unless skipping)
        if output_file.exists():
            if skip_checksum or (checksum and self._verify_checksum(output_file, checksum)):
                downloaded_files.append(output_file)
                continue

        # Download file
        self._download_file(
            file_id=file_id,
            output_file=output_file,
            expected_checksum=None if skip_checksum else checksum,
            progress_callback=progress_callback,
            file_index=i,
            total_files=len(files_info),
        )
        downloaded_files.append(output_file)

    return downloaded_files

`FetchConfig` `dataclass` ¶

Configuration for a connectome fetch operation.

Source code in src/lacuna/io/downloaders/base.py

@dataclass
class FetchConfig:
    """Configuration for a connectome fetch operation."""

    connectome: str
    """Connectome name to fetch (e.g., 'gsp1000', 'dtor985')."""

    output_dir: Path
    """Directory for processed output files."""

    # Authentication
    api_key: str | None = None
    """Dataverse API key (for GSP1000). Can also use DATAVERSE_API_KEY env var."""

    # Processing options
    batches: int = 10
    """Number of HDF5 batch files for functional connectomes."""

    keep_original: bool = True
    """Keep original downloaded files after processing."""

    # Registration
    register: bool = True
    """Automatically register connectome after processing."""

    register_name: str | None = None
    """Custom name for registration. Defaults to source name (e.g., 'GSP1000')."""

    # Behavior
    force: bool = False
    """Overwrite existing files and registrations."""

    resume: bool = True
    """Resume interrupted downloads."""

    @classmethod
    def from_cli_args(cls, args: argparse.Namespace) -> FetchConfig:
        """Create config from CLI arguments."""
        return cls(
            connectome=getattr(args, "connectome", ""),
            output_dir=Path(getattr(args, "output_dir", ".")),
            api_key=getattr(args, "api_key", None),
            batches=getattr(args, "batches", 10),
            keep_original=not getattr(args, "no_keep_original", False),
            register=not getattr(args, "no_register", False),
            register_name=getattr(args, "register_name", None),
            force=getattr(args, "force", False),
            resume=getattr(args, "resume", True),
        )

    def get_api_key(self) -> str | None:
        """Get API key from config, env var, or config file."""
        if self.api_key:
            return self.api_key
        if key := os.environ.get("DATAVERSE_API_KEY"):
            return key
        # Check config file
        return _load_config_file_key()

`api_key = None` `class-attribute` `instance-attribute` ¶

Dataverse API key (for GSP1000). Can also use DATAVERSE_API_KEY env var.

`batches = 10` `class-attribute` `instance-attribute` ¶

Number of HDF5 batch files for functional connectomes.

`connectome` `instance-attribute` ¶

Connectome name to fetch (e.g., 'gsp1000', 'dtor985').

`force = False` `class-attribute` `instance-attribute` ¶

Overwrite existing files and registrations.

`keep_original = True` `class-attribute` `instance-attribute` ¶

Keep original downloaded files after processing.

`output_dir` `instance-attribute` ¶

Directory for processed output files.

`register = True` `class-attribute` `instance-attribute` ¶

Automatically register connectome after processing.

`register_name = None` `class-attribute` `instance-attribute` ¶

Custom name for registration. Defaults to source name (e.g., 'GSP1000').

`resume = True` `class-attribute` `instance-attribute` ¶

Resume interrupted downloads.

`from_cli_args(args)` `classmethod` ¶

Create config from CLI arguments.

Source code in src/lacuna/io/downloaders/base.py

@classmethod
def from_cli_args(cls, args: argparse.Namespace) -> FetchConfig:
    """Create config from CLI arguments."""
    return cls(
        connectome=getattr(args, "connectome", ""),
        output_dir=Path(getattr(args, "output_dir", ".")),
        api_key=getattr(args, "api_key", None),
        batches=getattr(args, "batches", 10),
        keep_original=not getattr(args, "no_keep_original", False),
        register=not getattr(args, "no_register", False),
        register_name=getattr(args, "register_name", None),
        force=getattr(args, "force", False),
        resume=getattr(args, "resume", True),
    )

`get_api_key()` ¶

Get API key from config, env var, or config file.

Source code in src/lacuna/io/downloaders/base.py

def get_api_key(self) -> str | None:
    """Get API key from config, env var, or config file."""
    if self.api_key:
        return self.api_key
    if key := os.environ.get("DATAVERSE_API_KEY"):
        return key
    # Check config file
    return _load_config_file_key()

`FetchProgress` `dataclass` ¶

Progress information for fetch operations.

Source code in src/lacuna/io/downloaders/base.py

@dataclass
class FetchProgress:
    """Progress information for fetch operations."""

    phase: Literal["download", "processing", "registration"]
    """Current operation phase."""

    current_file: str
    """Name of file currently being processed."""

    files_completed: int
    """Number of files completed."""

    files_total: int
    """Total number of files to process."""

    bytes_transferred: int = 0
    """Bytes transferred in current download."""

    bytes_total: int = 0
    """Total bytes for current download."""

    message: str = ""
    """Human-readable status message."""

    @property
    def percent_complete(self) -> float:
        """Overall percentage completion."""
        if self.files_total == 0:
            return 0.0
        return (self.files_completed / self.files_total) * 100

    @property
    def download_percent(self) -> float:
        """Current file download percentage."""
        if self.bytes_total == 0:
            return 0.0
        return (self.bytes_transferred / self.bytes_total) * 100

`bytes_total = 0` `class-attribute` `instance-attribute` ¶

Total bytes for current download.

`bytes_transferred = 0` `class-attribute` `instance-attribute` ¶

Bytes transferred in current download.

`current_file` `instance-attribute` ¶

Name of file currently being processed.

`download_percent` `property` ¶

Current file download percentage.

`files_completed` `instance-attribute` ¶

Number of files completed.

`files_total` `instance-attribute` ¶

Total number of files to process.

`message = ''` `class-attribute` `instance-attribute` ¶

Human-readable status message.

`percent_complete` `property` ¶

Overall percentage completion.

`phase` `instance-attribute` ¶

Current operation phase.

`FetchResult` `dataclass` ¶

Result of a connectome fetch operation.

Source code in src/lacuna/io/downloaders/base.py

@dataclass
class FetchResult:
    """Result of a connectome fetch operation."""

    success: bool
    """Whether the operation completed successfully."""

    connectome_name: str
    """Name of the fetched connectome."""

    output_dir: Path
    """Directory containing processed files."""

    output_files: list[Path] = field(default_factory=list)
    """List of created output files."""

    registered: bool = False
    """Whether the connectome was registered."""

    register_name: str | None = None
    """Name used for registration, or None if not registered."""

    duration_seconds: float = 0.0
    """Total operation time in seconds."""

    download_time_seconds: float = 0.0
    """Time spent downloading."""

    processing_time_seconds: float = 0.0
    """Time spent processing."""

    warnings: list[str] = field(default_factory=list)
    """Non-fatal warnings encountered."""

    error: str | None = None
    """Error message if success=False."""

    def summary(self) -> str:
        """Generate human-readable summary."""
        if self.success:
            return (
                f"✅ Successfully fetched {self.connectome_name}\n"
                f"   Output: {self.output_dir}\n"
                f"   Files: {len(self.output_files)}\n"
                f"   Registered as: {self.register_name or 'not registered'}\n"
                f"   Time: {self.download_time_seconds:.1f}s download, "
                f"{self.processing_time_seconds:.1f}s processing"
            )
        return f"❌ Failed to fetch {self.connectome_name}: {self.error}"

`connectome_name` `instance-attribute` ¶

Name of the fetched connectome.

`download_time_seconds = 0.0` `class-attribute` `instance-attribute` ¶

Time spent downloading.

`duration_seconds = 0.0` `class-attribute` `instance-attribute` ¶

Total operation time in seconds.

`error = None` `class-attribute` `instance-attribute` ¶

Error message if success=False.

`output_dir` `instance-attribute` ¶

Directory containing processed files.

`output_files = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

List of created output files.

`processing_time_seconds = 0.0` `class-attribute` `instance-attribute` ¶

Time spent processing.

`register_name = None` `class-attribute` `instance-attribute` ¶

Name used for registration, or None if not registered.

`registered = False` `class-attribute` `instance-attribute` ¶

Whether the connectome was registered.

`success` `instance-attribute` ¶

Whether the operation completed successfully.

`warnings = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

Non-fatal warnings encountered.

`summary()` ¶

Generate human-readable summary.

Source code in src/lacuna/io/downloaders/base.py

def summary(self) -> str:
    """Generate human-readable summary."""
    if self.success:
        return (
            f"✅ Successfully fetched {self.connectome_name}\n"
            f"   Output: {self.output_dir}\n"
            f"   Files: {len(self.output_files)}\n"
            f"   Registered as: {self.register_name or 'not registered'}\n"
            f"   Time: {self.download_time_seconds:.1f}s download, "
            f"{self.processing_time_seconds:.1f}s processing"
        )
    return f"❌ Failed to fetch {self.connectome_name}: {self.error}"

`FigshareDownloader` ¶

Bases: BaseDownloader

Downloader for Figshare files using authenticated API.

Uses Figshare API with authentication token to get download URLs that bypass AWS WAF protection.

Parameters:

Name	Type	Description	Default
`source`	`ConnectomeSource`	Configuration for the connectome source.	required
`api_key`	`str`	Figshare API key. If not provided, uses FIGSHARE_API_KEY env var.	`None`

Source code in src/lacuna/io/downloaders/figshare.py

class FigshareDownloader(BaseDownloader):
    """
    Downloader for Figshare files using authenticated API.

    Uses Figshare API with authentication token to get download URLs
    that bypass AWS WAF protection.

    Parameters
    ----------
    source : ConnectomeSource
        Configuration for the connectome source.
    api_key : str, optional
        Figshare API key. If not provided, uses FIGSHARE_API_KEY env var.
    """

    # Figshare API base URL
    API_BASE = "https://api.figshare.com/v2"

    def __init__(
        self,
        source: ConnectomeSource,
        api_key: str | None = None,
    ):
        super().__init__(source)

        # Get API key from param or environment
        self.api_key = api_key or os.environ.get(FIGSHARE_API_KEY_ENV)

    def download(
        self,
        output_path: Path,
        progress_callback: Callable[[FetchProgress], None] | None = None,
    ) -> list[Path]:
        """
        Download file from Figshare using authenticated API.

        Parameters
        ----------
        output_path : Path
            Directory to download files to.
        progress_callback : callable, optional
            Function called with FetchProgress updates.

        Returns
        -------
        list[Path]
            List of downloaded file paths (single file for Figshare).

        Raises
        ------
        DownloadError
            If download fails or API key is missing.
        """
        # Check for API key
        if not self.api_key:
            raise DownloadError(
                url="",
                reason=(
                    f"Figshare API key required. Set via:\n"
                    f"  - Environment variable: {FIGSHARE_API_KEY_ENV}\n"
                    f"  - Command line: --api-key YOUR_KEY\n\n"
                    f"Get a free API key from:\n"
                    f"  https://figshare.com/account/applications\n"
                    f"  (Create 'Personal token' under 'Applications')"
                ),
            )

        # Check for article_id
        if not self.source.article_id:
            raise DownloadError(
                url="",
                reason="No article_id configured for Figshare source",
            )

        output_path = Path(output_path)
        output_path.mkdir(parents=True, exist_ok=True)

        # Get file info from API
        file_info = self._get_file_info()
        filename = file_info["name"]
        download_url = file_info["download_url"]
        total_size = file_info.get("size", 0)

        output_file = output_path / filename

        # Skip if already exists
        if output_file.exists():
            existing_size = output_file.stat().st_size
            if existing_size == total_size and total_size > 0:
                if progress_callback:
                    progress_callback(
                        FetchProgress(
                            phase="download",
                            current_file=filename,
                            files_completed=1,
                            files_total=1,
                            message=f"Already downloaded: {filename}",
                        )
                    )
                return [output_file]

        # Report progress
        if progress_callback:
            progress_callback(
                FetchProgress(
                    phase="download",
                    current_file=filename,
                    files_completed=0,
                    files_total=1,
                    message=f"Downloading {filename}",
                )
            )

        # Download file using authenticated URL
        self._download_file(
            url=download_url,
            output_file=output_file,
            total_size=total_size,
            progress_callback=progress_callback,
        )

        return [output_file]

    def _get_headers(self) -> dict[str, str]:
        """Get request headers with API authentication."""
        return {"Authorization": f"token {self.api_key}"}

    def _get_file_info(self) -> dict:
        """
        Get file information from Figshare API.

        Returns
        -------
        dict
            File info including name, download_url, and size.

        Raises
        ------
        DownloadError
            If API request fails.
        """
        api_url = f"{self.API_BASE}/articles/{self.source.article_id}/files"

        try:
            response = requests.get(
                api_url,
                headers=self._get_headers(),
                timeout=30,
            )
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 401:
                raise DownloadError(
                    url=api_url,
                    reason=(
                        "Figshare API authentication failed. "
                        "Check that your API key is valid.\n\n"
                        "Get a new API key from:\n"
                        "  https://figshare.com/account/applications"
                    ),
                ) from e
            raise DownloadError(
                url=api_url,
                reason=f"Figshare API error: {e}",
            ) from e
        except Exception as e:
            raise DownloadError(
                url=api_url,
                reason=f"Failed to connect to Figshare API: {e}",
            ) from e

        files = response.json()
        if not files:
            raise DownloadError(
                url=api_url,
                reason="No files found in Figshare article",
            )

        # Get the first (and usually only) file
        return files[0]

    def _get_filename_from_url(self, url: str) -> str | None:
        """Extract filename from URL path (kept for compatibility)."""
        from urllib.parse import unquote, urlparse

        parsed = urlparse(url)
        path = unquote(parsed.path)
        if "/" in path:
            filename = path.split("/")[-1]
            if "." in filename:
                return filename
        return None

    def _download_file(
        self,
        url: str,
        output_file: Path,
        total_size: int = 0,
        progress_callback: Callable[[FetchProgress], None] | None = None,
    ) -> None:
        """
        Download a single file from Figshare.

        Parameters
        ----------
        url : str
            Authenticated download URL.
        output_file : Path
            Output file path.
        total_size : int
            Expected file size in bytes.
        progress_callback : callable, optional
            Progress callback function.

        Raises
        ------
        DownloadError
            If download fails.
        """
        try:
            response = requests.get(
                url,
                headers=self._get_headers(),
                stream=True,
                timeout=60,
            )
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            raise DownloadError(
                url=url,
                reason=f"Download failed: HTTP {e.response.status_code}",
            ) from e
        except Exception as e:
            raise DownloadError(url=url, reason=str(e)) from e

        # Get size from headers if not provided
        if total_size == 0:
            total_size = int(response.headers.get("content-length", 0))

        # Check for HTML response (should not happen with API auth, but be safe)
        content_type = response.headers.get("content-type", "")
        if "text/html" in content_type.lower():
            raise DownloadError(
                url=url,
                reason=(
                    "Received HTML instead of file data. "
                    "The API token may have insufficient permissions."
                ),
            )

        # Use temp file for atomic write
        temp_file = output_file.with_suffix(output_file.suffix + ".tmp")

        try:
            with open(temp_file, "wb") as f:
                with tqdm(
                    total=total_size,
                    unit="B",
                    unit_scale=True,
                    desc=output_file.name,
                    disable=progress_callback is not None,
                ) as pbar:
                    bytes_downloaded = 0
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            f.write(chunk)
                            bytes_downloaded += len(chunk)
                            pbar.update(len(chunk))

                            if progress_callback:
                                progress_callback(
                                    FetchProgress(
                                        phase="download",
                                        current_file=output_file.name,
                                        files_completed=0,
                                        files_total=1,
                                        bytes_transferred=bytes_downloaded,
                                        bytes_total=total_size,
                                        message=f"Downloading {output_file.name}",
                                    )
                                )

            # Move to final location
            temp_file.rename(output_file)

            # Validate downloaded file
            self._validate_downloaded_file(output_file, url, total_size)

        except Exception:
            if temp_file.exists():
                temp_file.unlink()
            raise

    def _validate_downloaded_file(
        self,
        output_file: Path,
        url: str,
        expected_size: int = 0,
    ) -> None:
        """
        Validate that the downloaded file is valid.

        Parameters
        ----------
        output_file : Path
            Downloaded file to validate.
        url : str
            Original URL (for error messages).
        expected_size : int
            Expected file size in bytes.

        Raises
        ------
        DownloadError
            If file appears to be invalid.
        """
        file_size = output_file.stat().st_size

        # Check for size mismatch
        if expected_size > 0 and file_size != expected_size:
            output_file.unlink()
            raise DownloadError(
                url=url,
                reason=(
                    f"Downloaded file size ({file_size}) does not match "
                    f"expected size ({expected_size}). Download may be incomplete."
                ),
            )

        # Check for suspiciously small files
        if file_size < 10_000:
            with open(output_file, "rb") as f:
                header = f.read(1000)

            if b"<!DOCTYPE" in header or b"<html" in header:
                output_file.unlink()
                raise DownloadError(
                    url=url,
                    reason="Downloaded file is an HTML page, not the expected data file.",
                )

        # Validate .trk files
        if output_file.suffix == ".trk":
            with open(output_file, "rb") as f:
                f.seek(996)
                hdr_size_bytes = f.read(4)
                if len(hdr_size_bytes) == 4:
                    import struct

                    hdr_size = struct.unpack("<i", hdr_size_bytes)[0]
                    if hdr_size != 1000:
                        output_file.unlink()
                        raise DownloadError(
                            url=url,
                            reason=(
                                f"Invalid .trk file: header size is {hdr_size} "
                                "instead of 1000. File may be corrupted."
                            ),
                        )

`download(output_path, progress_callback=None)` ¶

Download file from Figshare using authenticated API.

Parameters:

Name	Type	Description	Default
`output_path`	`Path`	Directory to download files to.	required
`progress_callback`	`callable`	Function called with FetchProgress updates.	`None`

Returns:

Type	Description
`list[Path]`	List of downloaded file paths (single file for Figshare).

Raises:

Type	Description
`DownloadError`	If download fails or API key is missing.

Source code in src/lacuna/io/downloaders/figshare.py

def download(
    self,
    output_path: Path,
    progress_callback: Callable[[FetchProgress], None] | None = None,
) -> list[Path]:
    """
    Download file from Figshare using authenticated API.

    Parameters
    ----------
    output_path : Path
        Directory to download files to.
    progress_callback : callable, optional
        Function called with FetchProgress updates.

    Returns
    -------
    list[Path]
        List of downloaded file paths (single file for Figshare).

    Raises
    ------
    DownloadError
        If download fails or API key is missing.
    """
    # Check for API key
    if not self.api_key:
        raise DownloadError(
            url="",
            reason=(
                f"Figshare API key required. Set via:\n"
                f"  - Environment variable: {FIGSHARE_API_KEY_ENV}\n"
                f"  - Command line: --api-key YOUR_KEY\n\n"
                f"Get a free API key from:\n"
                f"  https://figshare.com/account/applications\n"
                f"  (Create 'Personal token' under 'Applications')"
            ),
        )

    # Check for article_id
    if not self.source.article_id:
        raise DownloadError(
            url="",
            reason="No article_id configured for Figshare source",
        )

    output_path = Path(output_path)
    output_path.mkdir(parents=True, exist_ok=True)

    # Get file info from API
    file_info = self._get_file_info()
    filename = file_info["name"]
    download_url = file_info["download_url"]
    total_size = file_info.get("size", 0)

    output_file = output_path / filename

    # Skip if already exists
    if output_file.exists():
        existing_size = output_file.stat().st_size
        if existing_size == total_size and total_size > 0:
            if progress_callback:
                progress_callback(
                    FetchProgress(
                        phase="download",
                        current_file=filename,
                        files_completed=1,
                        files_total=1,
                        message=f"Already downloaded: {filename}",
                    )
                )
            return [output_file]

    # Report progress
    if progress_callback:
        progress_callback(
            FetchProgress(
                phase="download",
                current_file=filename,
                files_completed=0,
                files_total=1,
                message=f"Downloading {filename}",
            )
        )

    # Download file using authenticated URL
    self._download_file(
        url=download_url,
        output_file=output_file,
        total_size=total_size,
        progress_callback=progress_callback,
    )

    return [output_file]

`GithubReleaseDownloader` ¶

Bases: BaseDownloader

Downloader for files hosted on GitHub Releases.

No authentication is required — files are downloaded via plain HTTP GET.

Parameters:

Name	Type	Description	Default
`source`	`ConnectomeSource`	Configuration for the connectome source. Must have `download_url` set.	required

Source code in src/lacuna/io/downloaders/github.py

class GithubReleaseDownloader(BaseDownloader):
    """
    Downloader for files hosted on GitHub Releases.

    No authentication is required — files are downloaded via plain HTTP GET.

    Parameters
    ----------
    source : ConnectomeSource
        Configuration for the connectome source. Must have ``download_url`` set.
    """

    def __init__(self, source: ConnectomeSource):
        super().__init__(source)

    def download(
        self,
        output_path: Path,
        progress_callback: Callable[[FetchProgress], None] | None = None,
    ) -> list[Path]:
        """
        Download file from GitHub Releases.

        Parameters
        ----------
        output_path : Path
            Directory to download files to.
        progress_callback : callable, optional
            Function called with FetchProgress updates.

        Returns
        -------
        list[Path]
            List of downloaded file paths (single file).

        Raises
        ------
        DownloadError
            If download fails or download_url is not configured.
        """
        if not self.source.download_url:
            raise DownloadError(
                url="",
                reason="No download_url configured for GitHub source",
            )

        output_path = Path(output_path)
        output_path.mkdir(parents=True, exist_ok=True)

        # Extract filename from URL
        filename = self._get_filename_from_url(self.source.download_url)
        output_file = output_path / filename

        # Skip if already exists
        if output_file.exists():
            if progress_callback:
                progress_callback(
                    FetchProgress(
                        phase="download",
                        current_file=filename,
                        files_completed=1,
                        files_total=1,
                        message=f"Already downloaded: {filename}",
                    )
                )
            return [output_file]

        # Report progress
        if progress_callback:
            progress_callback(
                FetchProgress(
                    phase="download",
                    current_file=filename,
                    files_completed=0,
                    files_total=1,
                    message=f"Downloading {filename}",
                )
            )

        # Download file
        self._download_file(
            url=self.source.download_url,
            output_file=output_file,
            progress_callback=progress_callback,
        )

        return [output_file]

    def _get_filename_from_url(self, url: str) -> str:
        """Extract filename from URL path."""
        parsed = urlparse(url)
        path = unquote(parsed.path)
        filename = path.split("/")[-1]
        if not filename:
            raise DownloadError(url=url, reason="Could not extract filename from URL")
        return filename

    def _download_file(
        self,
        url: str,
        output_file: Path,
        progress_callback: Callable[[FetchProgress], None] | None = None,
    ) -> None:
        """
        Download a single file via HTTP GET.

        Parameters
        ----------
        url : str
            Download URL.
        output_file : Path
            Output file path.
        progress_callback : callable, optional
            Progress callback function.

        Raises
        ------
        DownloadError
            If download fails.
        """
        try:
            response = requests.get(url, stream=True, timeout=60)
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            raise DownloadError(
                url=url,
                reason=f"Download failed: HTTP {e.response.status_code}",
            ) from e
        except Exception as e:
            raise DownloadError(url=url, reason=str(e)) from e

        total_size = int(response.headers.get("content-length", 0))

        # Check for HTML response
        content_type = response.headers.get("content-type", "")
        if "text/html" in content_type.lower():
            raise DownloadError(
                url=url,
                reason="Received HTML instead of file data. The URL may be invalid.",
            )

        # Use temp file for atomic write
        temp_file = output_file.with_suffix(output_file.suffix + ".tmp")

        try:
            with open(temp_file, "wb") as f:
                with tqdm(
                    total=total_size,
                    unit="B",
                    unit_scale=True,
                    desc=output_file.name,
                    disable=progress_callback is not None,
                ) as pbar:
                    bytes_downloaded = 0
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            f.write(chunk)
                            bytes_downloaded += len(chunk)
                            pbar.update(len(chunk))

                            if progress_callback:
                                progress_callback(
                                    FetchProgress(
                                        phase="download",
                                        current_file=output_file.name,
                                        files_completed=0,
                                        files_total=1,
                                        bytes_transferred=bytes_downloaded,
                                        bytes_total=total_size,
                                        message=f"Downloading {output_file.name}",
                                    )
                                )

            # Move to final location
            temp_file.rename(output_file)

        except Exception:
            if temp_file.exists():
                temp_file.unlink()
            raise

`download(output_path, progress_callback=None)` ¶

Download file from GitHub Releases.

Parameters:

Name	Type	Description	Default
`output_path`	`Path`	Directory to download files to.	required
`progress_callback`	`callable`	Function called with FetchProgress updates.	`None`

Returns:

Type	Description
`list[Path]`	List of downloaded file paths (single file).

Raises:

Type	Description
`DownloadError`	If download fails or download_url is not configured.

Source code in src/lacuna/io/downloaders/github.py

def download(
    self,
    output_path: Path,
    progress_callback: Callable[[FetchProgress], None] | None = None,
) -> list[Path]:
    """
    Download file from GitHub Releases.

    Parameters
    ----------
    output_path : Path
        Directory to download files to.
    progress_callback : callable, optional
        Function called with FetchProgress updates.

    Returns
    -------
    list[Path]
        List of downloaded file paths (single file).

    Raises
    ------
    DownloadError
        If download fails or download_url is not configured.
    """
    if not self.source.download_url:
        raise DownloadError(
            url="",
            reason="No download_url configured for GitHub source",
        )

    output_path = Path(output_path)
    output_path.mkdir(parents=True, exist_ok=True)

    # Extract filename from URL
    filename = self._get_filename_from_url(self.source.download_url)
    output_file = output_path / filename

    # Skip if already exists
    if output_file.exists():
        if progress_callback:
            progress_callback(
                FetchProgress(
                    phase="download",
                    current_file=filename,
                    files_completed=1,
                    files_total=1,
                    message=f"Already downloaded: {filename}",
                )
            )
        return [output_file]

    # Report progress
    if progress_callback:
        progress_callback(
            FetchProgress(
                phase="download",
                current_file=filename,
                files_completed=0,
                files_total=1,
                message=f"Downloading {filename}",
            )
        )

    # Download file
    self._download_file(
        url=self.source.download_url,
        output_file=output_file,
        progress_callback=progress_callback,
    )

    return [output_file]

`get_api_key(cli_key=None)` ¶

Get API key using priority order: CLI > env var > config file.

Parameters:

Name	Type	Description	Default
`cli_key`	`str`	API key provided via CLI argument.	`None`

Returns:

Type	Description
`str or None`	The API key, or None if not found.

Source code in src/lacuna/io/downloaders/base.py

def get_api_key(cli_key: str | None = None) -> str | None:
    """
    Get API key using priority order: CLI > env var > config file.

    Parameters
    ----------
    cli_key : str, optional
        API key provided via CLI argument.

    Returns
    -------
    str or None
        The API key, or None if not found.
    """
    if cli_key:
        return cli_key
    if key := os.environ.get("DATAVERSE_API_KEY"):
        return key
    return _load_config_file_key()

downloaders

lacuna.io.downloaders ¶

ConnectomeSource dataclass ¶

article_id = None class-attribute instance-attribute ¶

citation = '' class-attribute instance-attribute ¶

dataverse_server = 'https://dataverse.harvard.edu' class-attribute instance-attribute ¶

default_batches = 10 class-attribute instance-attribute ¶

description instance-attribute ¶

display_name instance-attribute ¶

download_url = None class-attribute instance-attribute ¶

estimated_size_gb = 0.0 class-attribute instance-attribute ¶

mask_url = None class-attribute instance-attribute ¶

n_subjects = 0 class-attribute instance-attribute ¶

name instance-attribute ¶

persistent_id = None class-attribute instance-attribute ¶

requires_mask = False class-attribute instance-attribute ¶

source_type instance-attribute ¶

space = 'MNI152NLin6Asym' class-attribute instance-attribute ¶

type instance-attribute ¶

DataverseDownloader ¶

download(output_path, progress_callback=None, test_mode=False, skip_checksum=False) ¶

FetchConfig dataclass ¶

api_key = None class-attribute instance-attribute ¶

batches = 10 class-attribute instance-attribute ¶

connectome instance-attribute ¶

force = False class-attribute instance-attribute ¶

keep_original = True class-attribute instance-attribute ¶

output_dir instance-attribute ¶

register = True class-attribute instance-attribute ¶

register_name = None class-attribute instance-attribute ¶

resume = True class-attribute instance-attribute ¶

from_cli_args(args) classmethod ¶

get_api_key() ¶

FetchProgress dataclass ¶

bytes_total = 0 class-attribute instance-attribute ¶

bytes_transferred = 0 class-attribute instance-attribute ¶

current_file instance-attribute ¶

download_percent property ¶

files_completed instance-attribute ¶

files_total instance-attribute ¶

message = '' class-attribute instance-attribute ¶

percent_complete property ¶

phase instance-attribute ¶

FetchResult dataclass ¶

connectome_name instance-attribute ¶

download_time_seconds = 0.0 class-attribute instance-attribute ¶

duration_seconds = 0.0 class-attribute instance-attribute ¶

error = None class-attribute instance-attribute ¶

output_dir instance-attribute ¶

output_files = field(default_factory=list) class-attribute instance-attribute ¶

processing_time_seconds = 0.0 class-attribute instance-attribute ¶

register_name = None class-attribute instance-attribute ¶

registered = False class-attribute instance-attribute ¶

success instance-attribute ¶

warnings = field(default_factory=list) class-attribute instance-attribute ¶

summary() ¶

FigshareDownloader ¶

download(output_path, progress_callback=None) ¶

GithubReleaseDownloader ¶

download(output_path, progress_callback=None) ¶

get_api_key(cli_key=None) ¶

`lacuna.io.downloaders` ¶

`ConnectomeSource` `dataclass` ¶

`article_id = None` `class-attribute` `instance-attribute` ¶

`citation = ''` `class-attribute` `instance-attribute` ¶

`dataverse_server = 'https://dataverse.harvard.edu'` `class-attribute` `instance-attribute` ¶

`default_batches = 10` `class-attribute` `instance-attribute` ¶

`description` `instance-attribute` ¶

`display_name` `instance-attribute` ¶

`download_url = None` `class-attribute` `instance-attribute` ¶

`estimated_size_gb = 0.0` `class-attribute` `instance-attribute` ¶

`mask_url = None` `class-attribute` `instance-attribute` ¶

`n_subjects = 0` `class-attribute` `instance-attribute` ¶

`name` `instance-attribute` ¶

`persistent_id = None` `class-attribute` `instance-attribute` ¶

`requires_mask = False` `class-attribute` `instance-attribute` ¶

`source_type` `instance-attribute` ¶

`space = 'MNI152NLin6Asym'` `class-attribute` `instance-attribute` ¶

`type` `instance-attribute` ¶

`DataverseDownloader` ¶

`download(output_path, progress_callback=None, test_mode=False, skip_checksum=False)` ¶

`FetchConfig` `dataclass` ¶

`api_key = None` `class-attribute` `instance-attribute` ¶

`batches = 10` `class-attribute` `instance-attribute` ¶

`connectome` `instance-attribute` ¶

`force = False` `class-attribute` `instance-attribute` ¶

`keep_original = True` `class-attribute` `instance-attribute` ¶

`output_dir` `instance-attribute` ¶

`register = True` `class-attribute` `instance-attribute` ¶

`register_name = None` `class-attribute` `instance-attribute` ¶

`resume = True` `class-attribute` `instance-attribute` ¶

`from_cli_args(args)` `classmethod` ¶

`get_api_key()` ¶

`FetchProgress` `dataclass` ¶

`bytes_total = 0` `class-attribute` `instance-attribute` ¶

`bytes_transferred = 0` `class-attribute` `instance-attribute` ¶

`current_file` `instance-attribute` ¶

`download_percent` `property` ¶

`files_completed` `instance-attribute` ¶

`files_total` `instance-attribute` ¶

`message = ''` `class-attribute` `instance-attribute` ¶

`percent_complete` `property` ¶

`phase` `instance-attribute` ¶

`FetchResult` `dataclass` ¶

`connectome_name` `instance-attribute` ¶

`download_time_seconds = 0.0` `class-attribute` `instance-attribute` ¶

`duration_seconds = 0.0` `class-attribute` `instance-attribute` ¶

`error = None` `class-attribute` `instance-attribute` ¶

`output_dir` `instance-attribute` ¶

`output_files = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

`processing_time_seconds = 0.0` `class-attribute` `instance-attribute` ¶

`register_name = None` `class-attribute` `instance-attribute` ¶

`registered = False` `class-attribute` `instance-attribute` ¶

`success` `instance-attribute` ¶

`warnings = field(default_factory=list)` `class-attribute` `instance-attribute` ¶

`summary()` ¶

`FigshareDownloader` ¶

`download(output_path, progress_callback=None)` ¶

`GithubReleaseDownloader` ¶

`download(output_path, progress_callback=None)` ¶

`get_api_key(cli_key=None)` ¶