• kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    33
    ·
    4 months ago

    For everyone predicting how this will corrupt models…

    All the LLMs already are trained on Reddit’s data at least from before 2015 (which is when there was a dump of the entire site compiled for research).

    This is only going to be adding recent Reddit data.

    • Stovetop@lemmy.world
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      1
      ·
      4 months ago

      This is only going to be adding recent Reddit data.

      A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

      • kromem@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        ·
        edit-2
        4 months ago

        It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

        The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.