• percent@infosec.pub
    link
    fedilink
    English
    arrow-up
    5
    ·
    3 days ago

    There are huge public datasets that are often used for pretraining. Common Crawl and C4 are probably the most prominent, but there are others.

    There are also big public datasets available for fine-running and instruction tuning.

    The open weight models are getting pretty powerful, thanks to some Chinese labs.