ScraperLib#
- class src.ScraperLib.DownloadState(state_file: str = 'download_state.json', incremental: bool = True)[source]#
Bases:
object
Manages the persistent state of downloads, including completed, failed, and delay statistics.
This class provides atomic file operations for safe concurrent access, tracks download progress, and maintains statistics for reporting and incremental downloads.
- Parameters:
state_file (str) – Path to the state file (JSON).
incremental (bool) – If True, loads existing state; otherwise, starts fresh.
- add_completed(url: str, filepath: str, size: int) None [source]#
Mark a file as successfully downloaded.
- Parameters:
url (str) – File URL.
filepath (str) – Local file path.
size (int) – File size in bytes.
- add_delay(delay: float, success: bool = True) None [source]#
Record a new delay and update statistics.
- Parameters:
delay (float) – Delay value in seconds.
success (bool) – True if the delay was for a successful download.
- add_failed(url: str, error: Any) None [source]#
Mark a file as failed to download.
- Parameters:
url (str) – File URL.
error (str) – Error message.
- generate() Dict[str, Any] [source]#
Initialize a fresh state structure and salva no disco.
- Returns:
The initialized state dictionary.
- Return type:
dict
- get_file_id(url: str) str [source]#
Generate a unique file ID for a given URL.
- Parameters:
url (str) – File URL.
- Returns:
MD5 hash of the URL.
- Return type:
str
- is_completed(url: str, data_dir: str | None = None) bool [source]#
Check if a file has already been downloaded and exists on disk.
- Parameters:
url (str) – File URL.
data_dir (str, optional) – Directory where files are stored.
- Returns:
True if completed and file exists, False otherwise.
- Return type:
bool
- load_state() None [source]#
Load state from disk into memory.
Reads the JSON from the configured state_file path and populates the internal cache. If the file does not exist or contains invalid JSON, a fresh state is generated and saved.
- Returns:
None
- Raises:
IOError – If the state file cannot be opened or read.
- save_state() None [source]#
Save the current state to disk.
Updates the ‘last_update’ timestamp and writes the in-memory state cache to the state file using atomic (locked) file operations for safety.
- Returns:
None
- Raises:
IOError – If the state file cannot be written due to I/O errors.
- property state: Dict[str, Any]#
Get the current state dictionary.
- Returns:
The current state.
- Return type:
dict
- class src.ScraperLib.ScraperLib(base_url: str, file_patterns: List[str], download_dir: str = 'downloads', state_file: str = 'state/download_state.json', log_file: str = 'logs/scraper_log.log', output_dir: str = 'output', incremental: bool = True, max_files: int | None = None, max_concurrent: int | None = None, headers: Dict[str, str] | None = None, user_agents: List[str] | None = None, report_prefix: str = 'download_report', disable_logging: bool = False, disable_terminal_logging: bool = False, dataset_name: str | None = None, disable_progress_bar: bool = False, max_old_logs: int = 10, max_old_runs: int = 10, ray_instance: Any | None = None, chunk_size: str | int = '5MB', initial_delay: float = 1.0, max_delay: float = 60.0, max_retries: int = 5)[source]#
Bases:
object
Library for parallel file extraction and download, report generation, and log/result rotation.
- static cli() None [source]#
Command-line interface to run ScraperLib.
Reads arguments from the terminal, instantiates ScraperLib, and runs the process.
- Usage example:
python -m scraper_lib.cli –url <URL> –patterns .csv .zip –dir data –max-files 10
- Parameters:
--url – Base URL to scrape for files.
--patterns – List of file patterns to match (e.g. .csv .zip).
--dir – Download directory.
--incremental – Enable incremental download state.
--max-files – Limit number of files to download.
--max-concurrent – Max parallel downloads.
--chunk-size – Chunk size for downloads (e.g. 1gb, 10mb, 8 bytes).
--initial-delay – Initial delay between retries (seconds).
--max-delay – Maximum delay between retries (seconds).
--max-retries – Maximum number of download retries.
--state-file – Path for download state file.
--log-file – Path for main log file.
--report-prefix – Prefix for report files.
--headers – Path to JSON file with custom headers.
--user-agents – Path to text file with custom user agents (one per line).
--disable-logging – Disable all logging for production pipelines.
--disable-terminal-logging – Disable terminal logging.
--dataset-name – Dataset name for banner.
--disable-progress-bar – Disable tqdm progress bar.
--output-dir – Directory for report PNGs and JSON.
--max-old-logs – Max old log files to keep (default: 25, None disables rotation).
--max-old-runs – Max old report/png runs to keep (default: 25, None disables rotation).