跳到主要内容

Dataset

This article introduces the usage of the Bohrium dataset.

What is Dataset?

The Bohrium dataset provides capabilities for data import, download, data version management, data sharing, and dataset mounting. Have you ever encountered the following problems before:

  • Most of the input files for jobs are the same, but every time you submit a job, you have to wait for the file packaging and uploading process, resulting in low job submission efficiency.
  • Input files are large, and the packaging and uploading process takes a long time when submitting jobs.
  • I have some files that I want to share with others, but I don't know how to share them.
  • ...

Now, the dataset can solve the above problems for you, improve job submission efficiency, and address data sharing needs.

Creating a Dataset

Click the "Dataset" button in the main menu to enter the dataset list page, as shown in Figure 1. Click the "New Dataset" button, as shown in Figure 2, to enter the dataset creation page.

创建数据集

Fill in the basic information of the dataset and upload files. After clicking "create", the system will create the v1 version of the dataset by default. Click here to view the Content Description.

图片

After the information and files are prepared, click "Create". The dataset is created successfully, and the page will automatically redirect to the details page of this dataset version.

Viewing a Dataset

Click the "Dataset" button in the main menu to enter the dataset list page. The list displays all the datasets you can use, including the datasets you created and the datasets others have created and shared with you.

图片

Click on the dataset name to enter the dataset details page, where you can view the basic information of the dataset and the information of each version, obtain the file paths of each version, view and download version files.

图片

Editing and version management of a dataset.

If you have the management permissions for the dataset, you can perform operations such as adding new versions, deleting, and editing the basic information of the created dataset.

Version Management

If you need to make changes to the files in the current dataset, you can release a new version by using the "Create New Version" method.

  1. Creating: Click the "New Version" button to enter the new dataset version creation page. The system will automatically import the existing files from the latest published version. You can add or delete files as needed, and click "Create" to release the new version.

    图片

    图片

  2. Waiting for preparation to complete: Creating a new version requires some preparation time. During the preparation, other users cannot see or use this version. The duration of the preparation time is related to the number and size of version files. Please wait for the version to be ready before using it.

  3. After a version is created, the files within the version cannot be changed. If adjustments are needed, you can create a new version.

All the versions you have released will be displayed in the dataset. You can add and delete dataset versions according to your actual needs. Other users can only see the datasets that you have successfully published.

Notice: Deleted versions cannot be restored and will no longer be viewable or usable.

Editing a Dataset

  • Click the "Edit" button on the dataset list page or dataset details page to modify the dataset's name, description, and permission scope.

    图片

  • In the dataset details page, you can also modify the description of each version.

    图片

Using a Dataset

The dataset is currently supported in the following scenarios:

Submitting a job

  1. Command Line Submission: You only need to modify your job.json, adding a dataset_path field. In this field, fill in the corresponding paths of the dataset versions you need to use in an array format, as shown in the red box in the image below.

图片

When submitting a job, the method of specifying the input file directory is still supported, and both can be used simultaneously.

Here is an example of how to fill in job.json:

{
"job_name": "DeePMD-kit test",
"command": " cd se_e2_a && dp train input.json > tmp_log 2>&1 && dp freeze -o graph.pb",
"log_file": "se_e2_a/tmp_log",
"backward_files": ["se_e2_a/lcurve.out","se_e2_a/graph.pb"],
"project_id": 0000,
"platform": "ali",
"machine_type": "c4_m15_1 * NVIDIA T4",
"job_type": "container",
"image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
"dataset_path": ["/bohr/test1-51ov/v1","/bohr/test1-51ov/v2"]
}
  1. Web submission:

When submitting a job on the graphical interface, click the 'Select Dataset' button and choose the version of dataset you wish to use.

图片

图片

Use and share datasets in Notebook

When writing and posting notebook, you can use and share the datasets required for the notebook along with it.

Step 1: Select the dataset you want to use/share

On the Bohrium homepage, click the 'New - Notebook' button at the top left corner to enter the Notebook editing page.

图片

Click the arrow on the right side to expand the extension panel.

图片

Click on "Select Existing Datasets" to add the dataset version required for this notebook. You can also click on "New Dataset" to create a new dataset.

图片

Notice: Please add the dataset before connecting to the node. Datasets added after the node has started will require a node restart to take effect.

Step 2: Use the dataset in the Notebook

Move the mouse over the selected dataset name and click the copy button to get the storage path of the dataset files. All dataset files are stored in this path.

图片

Simply enter this path in the Notebook to use it. The path used in the example below is: /bohr/testdataset-6xwt/v1/

Example 1: Enter the dataset directory

cd /bohr/testdataset-6xwt/v1/

Example 2: List all files under the dataset

ls /bohr/testdataset-6xwt/v1/

Step 3: Post the Notebook and share the dataset

After the Notebook with the added dataset is posted, other users can view and use the corresponding dataset on the details page.

图片

Use the dataset on the management node

When you create container management node, you can add the version of the dataset you need to mount, as shown in the figure below as '1'. After successful mounting and booting, you can find the dataset files on the management node at the path shown as '2'.

图片

Dataset content description

Field NameDescriptionExample
Dataset NameThe name of the dataset, which can be modified at any timetestdataset
Dataset PathThe dataset files will be uploaded to this path. Please enter the recognizable content of the dataset in the input box, and the system will automatically generate a unique path corresponding to the version
Notice: Modifying the path after uploading files will clear the files you have already uploaded, so please modify with caution
/bohr/testdataset-b2dh/v1
FilesThe files included in this dataset version, support uploading local files or folders
Notice: Please do not refresh or leave the page during file upload to avoid upload failure
--
ProjectThe project to which the dataset belongs. Project members can use the dataset by default.testproject
PermissionsManageable: Permissions for editing, deleting, creating new versions, etc., of the dataset; the dataset creator and the creator and administrator of the project to which the dataset belongs have these permissions by default and cannot be changed
Usable: The permission to view and use the dataset; project members to which the dataset belongs have this permission by default and cannot be changed. This permission can be granted to other projects or users
Manageable: the dataset creator and the creator and administrator of the project to which the dataset belongs
Usable: project members to which the dataset belongs
DescriptionThe description of the dataset该数据集用于测试