The Smack Dataset
The Smack Dataset does not exist. In the future, if it arises, it will be a libre build of The Stack dataset without using the original dataset directly due to non-libre (non-“open source”) license encumbrances.
Note
Parrot is in early development, not ready for end users.
The Stack Metadata
The Stack has a separate metadata repository containing information about the dataset without hosting the dataset itself. This practice is beneficial as it allows researchers to understand dataset contents without being bound by licenses. For instance, how can one agree to a license when they’re unaware of the content’s licenses? By using metadata files, this issue can be mitigated.
Link to the Git Repository:
git clone https://huggingface.co/datasets/bigcode/the-stack-metadata
Downloading Metadata
The metadata is considerably less than the entire dataset, but still substantially large. The Git metadata repository is approximately one terabyte in size.
Reading Metadata
The Stack’s metadata is stored in parquet format. The parquet files span 562 gigabytes and consist of 2,832 individual files across 945 directories.
Selecting Repos
Write a script to filter appropriate repositories based on libre criteria.
Cloning Repos
Write a script to clone the selected repositories.
Train
Utilize libre code from Bigcode (creators of The Stack) for model training.
Scripts
The following scripts are available:
the-stack-headers
– Retrieves header names from The Stack’s parquet files.the-stack-licenses
– Extracts licenses and records from The Stack’s license file.
Code Assist
The following scripts were developed using Parrot code assist:
the-stack-headers
the-stack-licenses
These scripts were created with the The Phind-CodeLlama-34B-v2_q8.guff model from TheBloke.
Note
Parrot documentation is written in English and uses AI machine translation for other languages.