Skip to content

Remote

FTPRemoteResource dataclass

Bases: RemoteResource

Source code in bionemo/llm/utils/remote.py
145
146
147
148
149
150
151
152
153
154
155
156
157
class FTPRemoteResource(RemoteResource):  # noqa: D101
    def download_resource(self, overwrite=False) -> str:
        """Downloads the resource to its specified fully_qualified_dest name.

        Returns: the fully qualified destination filename.
        """
        self.exists_or_create_destination_directory()

        if not self.check_exists() or overwrite:
            request.urlretrieve(self.url, self.fully_qualified_dest_filename)

        self.check_exists()
        return self.fully_qualified_dest_filename

download_resource(overwrite=False)

Downloads the resource to its specified fully_qualified_dest name.

Returns: the fully qualified destination filename.

Source code in bionemo/llm/utils/remote.py
146
147
148
149
150
151
152
153
154
155
156
157
def download_resource(self, overwrite=False) -> str:
    """Downloads the resource to its specified fully_qualified_dest name.

    Returns: the fully qualified destination filename.
    """
    self.exists_or_create_destination_directory()

    if not self.check_exists() or overwrite:
        request.urlretrieve(self.url, self.fully_qualified_dest_filename)

    self.check_exists()
    return self.fully_qualified_dest_filename

RemoteResource dataclass

Responsible for downloading remote files, along with optional processing of downloaded files for downstream usecases.

Each object is invoked through either its constructor (setting up the destination and checksum), or through a pre-configured class method. download_resource() contains the core functionality, which is to download the file at url to the fully qualified filename. Class methods can be used to further configure this process.

Receive

a file, its checksum, a destination directory, and a root directory

Our dataclass then provides some useful things: - fully qualified destination folder (property) - fully qualified destination file (property) - check_exists() - download_resource()

Form the fully qualified destination folder. Create a fully qualified path for the file

(all lives in the download routine) Check that the fq destination folder exists, otherwise create it Download the file. Checksum the download. Done.

Postprocessing should be their own method with their own configuration.

Example usage

The following will download and preprocess the prepackaged resources.

GRCh38Ensembl99ResourcePreparer().prepare() Hg38chromResourcePreparer().prepare() GRCh38p13_ResourcePreparer().prepare()

Attributes:

Name Type Description
dest_directory str

The directory to place the desired file upon completing the download. Should have the form {dest_directory}/{dest_filename}

dest_filename str

The desired name for the file upon completing the download.

checksum Optional[str]

checksum associated with the file located at url. If set to None, check_exists only checks for the existance of {dest_directory}/{dest_filename}

url Optional[str]

URL of the file to download

root_directory str | PathLike

the bottom-level directory, the fully qualified path is formed by joining root_directory, dest_directory, and dest_filename.

Source code in bionemo/llm/utils/remote.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@dataclass
class RemoteResource:
    """Responsible for downloading remote files, along with optional processing of downloaded files for downstream usecases.

    Each object is invoked through either its constructor (setting up the destination and checksum), or through a pre-configured class method.
    `download_resource()` contains the core functionality, which is to download the file at `url` to the fully qualified filename. Class methods
    can be used to further configure this process.

    Receive:
        a file, its checksum, a destination directory, and a root directory

        Our dataclass then provides some useful things:
            - fully qualified destination folder (property)
            - fully qualified destination file (property)
            - check_exists()
            - download_resource()

        Form the fully qualified destination folder.
        Create a fully qualified path for the file

        (all lives in the download routine)
        Check that the fq destination folder exists, otherwise create it
        Download the file.
        Checksum the download.
        Done.

        Postprocessing should be their own method with their own configuration.

    Example usage:
        >>> # The following will download and preprocess the prepackaged resources.
        >>> GRCh38Ensembl99ResourcePreparer().prepare()
        >>> Hg38chromResourcePreparer().prepare()
        >>> GRCh38p13_ResourcePreparer().prepare()


    Attributes:
        dest_directory: The directory to place the desired file upon completing the download. Should have the form {dest_directory}/{dest_filename}
        dest_filename: The desired name for the file upon completing the download.
        checksum: checksum associated with the file located at url. If set to None, check_exists only checks for the existance of `{dest_directory}/{dest_filename}`
        url: URL of the file to download
        root_directory: the bottom-level directory, the fully qualified path is formed by joining root_directory, dest_directory, and dest_filename.
    """

    checksum: Optional[str]
    dest_filename: str
    dest_directory: str
    root_directory: str | os.PathLike = BIONEMO_CACHE_DIR
    url: Optional[str] = None

    @property
    def fully_qualified_dest_folder(self):  # noqa: D102
        return Path(self.root_directory) / self.dest_directory

    @property
    def fully_qualified_dest_filename(self):
        """Returns the fully qualified destination path of the file.

        Example:
            /tmp/my_folder/file.tar.gz
        """
        return os.path.join(self.fully_qualified_dest_folder, self.dest_filename)

    def exists_or_create_destination_directory(self, exist_ok=True):
        """Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).

        exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
        """
        os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)

    @staticmethod
    def get_env_tmpdir():
        """Convenience method that exposes the environment TMPDIR variable."""
        return os.environ.get("TMPDIR", "/tmp")

    def download_resource(self, overwrite=False) -> str:
        """Downloads the resource to its specified fully_qualified_dest name.

        Returns: the fully qualified destination filename.
        """
        self.exists_or_create_destination_directory()

        if not self.check_exists() or overwrite:
            logging.info(f"Downloading resource: {self.url}")
            with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
                r.raise_for_status()
                for bytes in r:
                    fd.write(bytes)
        else:
            logging.info(f"Resource already exists, skipping download: {self.url}")

        self.check_exists()
        return self.fully_qualified_dest_filename

    def check_exists(self):
        """Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`"""  # noqa: D415
        if os.path.exists(self.fully_qualified_dest_filename):
            with open(self.fully_qualified_dest_filename, "rb") as fd:
                data = fd.read()
                result = md5(data).hexdigest()
            if self.checksum is None:
                logging.info("No checksum provided, filename exists. Assuming it is complete.")
                matches = True
            else:
                matches = result == self.checksum
            return matches

        return False

fully_qualified_dest_filename property

Returns the fully qualified destination path of the file.

Example

/tmp/my_folder/file.tar.gz

check_exists()

Returns true if fully_qualified_dest_filename exists and the checksum matches self.checksum

Source code in bionemo/llm/utils/remote.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def check_exists(self):
    """Returns true if `fully_qualified_dest_filename` exists and the checksum matches `self.checksum`"""  # noqa: D415
    if os.path.exists(self.fully_qualified_dest_filename):
        with open(self.fully_qualified_dest_filename, "rb") as fd:
            data = fd.read()
            result = md5(data).hexdigest()
        if self.checksum is None:
            logging.info("No checksum provided, filename exists. Assuming it is complete.")
            matches = True
        else:
            matches = result == self.checksum
        return matches

    return False

download_resource(overwrite=False)

Downloads the resource to its specified fully_qualified_dest name.

Returns: the fully qualified destination filename.

Source code in bionemo/llm/utils/remote.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def download_resource(self, overwrite=False) -> str:
    """Downloads the resource to its specified fully_qualified_dest name.

    Returns: the fully qualified destination filename.
    """
    self.exists_or_create_destination_directory()

    if not self.check_exists() or overwrite:
        logging.info(f"Downloading resource: {self.url}")
        with requests.get(self.url, stream=True) as r, open(self.fully_qualified_dest_filename, "wb") as fd:
            r.raise_for_status()
            for bytes in r:
                fd.write(bytes)
    else:
        logging.info(f"Resource already exists, skipping download: {self.url}")

    self.check_exists()
    return self.fully_qualified_dest_filename

exists_or_create_destination_directory(exist_ok=True)

Checks that the fully_qualified_destination_directory exists, if it does not, the directory is created (or fails).

exists_ok: Triest to create fully_qualified_dest_folder if it doesnt already exist.

Source code in bionemo/llm/utils/remote.py
 98
 99
100
101
102
103
def exists_or_create_destination_directory(self, exist_ok=True):
    """Checks that the `fully_qualified_destination_directory` exists, if it does not, the directory is created (or fails).

    exists_ok: Triest to create `fully_qualified_dest_folder` if it doesnt already exist.
    """
    os.makedirs(self.fully_qualified_dest_folder, exist_ok=exist_ok)

get_env_tmpdir() staticmethod

Convenience method that exposes the environment TMPDIR variable.

Source code in bionemo/llm/utils/remote.py
105
106
107
108
@staticmethod
def get_env_tmpdir():
    """Convenience method that exposes the environment TMPDIR variable."""
    return os.environ.get("TMPDIR", "/tmp")