Although this is characterized as 1.0, it is governed by the Terms of Use for Early-Access, which are quite limiting, including: "You may use the Service solely for noncommercial purposes."
It really is rather peculiar to me. They frame it like this (emphasis mine):
"With the preliminary publication of this dataset, we further seek to establish a community-led process to grow, improve, and use institutional data in ways that strengthen the knowledge ecosystem and assert the importance of ongoing stewardship of training data from the originating knowledge institutions themselves. To this end, we are experimenting to find the best way to release this data in a manner that facilitates collaboration. We encourage input on this process to guide the full publication of this and future dataset dataset releases, beginning with the following decisions:
* At preliminary launch, we have published the metadata, including experimental metadata, in full
for anyone to access and use.
* At preliminary launch, we have published the dataset including OCR-extracted text under a
noncommercial license, and with a 'click-through' that requires users to accept this license,
additional terms of use, and to share basic contact information with us so that we can engage the community in its early use.
* At preliminary launch, we have chosen to postpone the release of the raw scan images, though we
will share them liberally with researchers and libraries who wish to review them. While we know
AI developers and researchers are eager for more raw materials, we believe this minor friction
can help build the relationships and norms necessary to grow a collaborative community."
It is the fruit of their labour (well, the digitalisation is), so it is up to them to license it as they see fit. But it feels odd to me that they seem to want to be in control to this degree. In open source and my own research field, the pattern we tend to follow is to release freely, observe, and then build relationships rather than holding a "license gun" towards the head of potential collaborators.
Lastly, I have only skimmed the pre-print, but I noted no commitment to a final license either. Not even a direction for it. Thus, as a natural language processing researcher I will stay clear from this dataset for the time being and hope the licensing situation improves.
“noncommericial” seems pernicious to me lately. I can see why people reach for it, but it really seems hard to define (there are many ways to profit off of something without simply selling it directly).
This has always been my problem with CC-NC, its just not clear to me what counts as "commercial" or not.
Can't sell the item itself? Okay, makes sense.
What about a downstream manufactured item? Such as a CC-NC STL, where you have since 3d printed it? You can't sell the STL, but what about the printed object? If not for profit, must you necessarily take a loss, or could you sell the items at-cost?
Or offering a CC-NC item for free, in the same place you are selling other products for profit? Where the CC-NC item may be acting as a "loss-leader" to get customers to purchase your commercial offerings?
Or giving everything away, the CC-NC item and all other items, but while representing a commercial entity who is doing such for marketing purposes with the end-goal of generating more revenue for the business?
I much prefer GPL/CC-SA licenses, they're much clearer where the line sits in regards to usage.
Don't most of these licenses also include "derived works". The trivial case, you get an STL, you print the object, it's clearly derived, you get some code, you edit parts of it for a new application, it's clearly derived.
Personally I feel it's also fairly trivial that an AI model is a derived work, but...there is so much money, people risk it (eg early Spotify and sourcing music) and hope it becomes a non-issue.
HOWEVER, as China and co are going to wholesale ignore any IP/copyright to train AI models, the choice we have...may not be much of a choice at all.
IANAL, but I think it goes even one step beyond that, which is that the item and derived works can't even be used to support a commercial enterprise, even if the (derived) work isn't being sold or seen by the outside public.
I'm sure Harvard doesn't consider its use as commercial, even though some people there get big salaries. Claudine Gay, for instance, makes more than $1m/year even after losing the job of President in the scandal. There are only a few "commercial" businesses that pay that well.
AIs lizard brain will be 60% 1800s apparently, it might act like a villainous steampunk anglosaxon twirling a mustache in moments of survival, or at least some blend of those values while playing 5d chess. Read it H G Wells "World brain" to calm it down like a fond childhood memory
This would be the funniest possible future, and a very distinct possibility depending on how the NYT lawsuit turns out in regards to IP holder rights versus AI "copyright laundering".
Nobody intended "search engines" to be a repository of cultural memory. They became that because they were built on content and information that encompassed cultural memory, and people used them for that purpose. How many times have you told someone to Google that instead of giving them a URL?
Training sets are currently built on the same information, and now chatbots are a different way to query for that information. So, in the same way as with search engines, chatbots have become another repository of cultural memory.
At some time in the future, people will come to believe that if it's not in a search engine or a chatbot, it doesn't exist, which to me is why it's vital to put everything we know into a training set in addition to archiving it someplace that will survive a Carrington-level blast from the sun.
IMO, making multiple copies of archives of everything we know supersedes copyright.
Although this is characterized as 1.0, it is governed by the Terms of Use for Early-Access, which are quite limiting, including: "You may use the Service solely for noncommercial purposes."
It really is rather peculiar to me. They frame it like this (emphasis mine):
"With the preliminary publication of this dataset, we further seek to establish a community-led process to grow, improve, and use institutional data in ways that strengthen the knowledge ecosystem and assert the importance of ongoing stewardship of training data from the originating knowledge institutions themselves. To this end, we are experimenting to find the best way to release this data in a manner that facilitates collaboration. We encourage input on this process to guide the full publication of this and future dataset dataset releases, beginning with the following decisions:
* At preliminary launch, we have published the metadata, including experimental metadata, in full for anyone to access and use.
* At preliminary launch, we have published the dataset including OCR-extracted text under a noncommercial license, and with a 'click-through' that requires users to accept this license, additional terms of use, and to share basic contact information with us so that we can engage the community in its early use.
* At preliminary launch, we have chosen to postpone the release of the raw scan images, though we will share them liberally with researchers and libraries who wish to review them. While we know AI developers and researchers are eager for more raw materials, we believe this minor friction can help build the relationships and norms necessary to grow a collaborative community."
It is the fruit of their labour (well, the digitalisation is), so it is up to them to license it as they see fit. But it feels odd to me that they seem to want to be in control to this degree. In open source and my own research field, the pattern we tend to follow is to release freely, observe, and then build relationships rather than holding a "license gun" towards the head of potential collaborators.
Lastly, I have only skimmed the pre-print, but I noted no commitment to a final license either. Not even a direction for it. Thus, as a natural language processing researcher I will stay clear from this dataset for the time being and hope the licensing situation improves.
“noncommericial” seems pernicious to me lately. I can see why people reach for it, but it really seems hard to define (there are many ways to profit off of something without simply selling it directly).
This has always been my problem with CC-NC, its just not clear to me what counts as "commercial" or not.
Can't sell the item itself? Okay, makes sense.
What about a downstream manufactured item? Such as a CC-NC STL, where you have since 3d printed it? You can't sell the STL, but what about the printed object? If not for profit, must you necessarily take a loss, or could you sell the items at-cost?
Or offering a CC-NC item for free, in the same place you are selling other products for profit? Where the CC-NC item may be acting as a "loss-leader" to get customers to purchase your commercial offerings?
Or giving everything away, the CC-NC item and all other items, but while representing a commercial entity who is doing such for marketing purposes with the end-goal of generating more revenue for the business?
I much prefer GPL/CC-SA licenses, they're much clearer where the line sits in regards to usage.
Don't most of these licenses also include "derived works". The trivial case, you get an STL, you print the object, it's clearly derived, you get some code, you edit parts of it for a new application, it's clearly derived.
Personally I feel it's also fairly trivial that an AI model is a derived work, but...there is so much money, people risk it (eg early Spotify and sourcing music) and hope it becomes a non-issue.
HOWEVER, as China and co are going to wholesale ignore any IP/copyright to train AI models, the choice we have...may not be much of a choice at all.
>>Can't sell the item itself? Okay, makes sense.
IANAL, but I think it goes even one step beyond that, which is that the item and derived works can't even be used to support a commercial enterprise, even if the (derived) work isn't being sold or seen by the outside public.
Interesting; If true, that effectively means the answer to all my questions would be "no" then
That's exactly why it's problematic in licenses unless explicitly defined.
I'm sure Harvard doesn't consider its use as commercial, even though some people there get big salaries. Claudine Gay, for instance, makes more than $1m/year even after losing the job of President in the scandal. There are only a few "commercial" businesses that pay that well.
These are all public domain books, which they don't have the rights to relicense like this.
https://huggingface.co/datasets/institutional/institutional-... https://huggingface.co/datasets/institutional/institutional-...
https://github.com/instdin/institutional-books-1-pipeline https://www.institutionaldatainitiative.org/institutional-bo...
AIs lizard brain will be 60% 1800s apparently, it might act like a villainous steampunk anglosaxon twirling a mustache in moments of survival, or at least some blend of those values while playing 5d chess. Read it H G Wells "World brain" to calm it down like a fond childhood memory
This would be the funniest possible future, and a very distinct possibility depending on how the NYT lawsuit turns out in regards to IP holder rights versus AI "copyright laundering".
https://lifearchitect.ai/datasets-table/
[flagged]
[flagged]
Edit: Two responses, https://news.ycombinator.com/item?id=44252450 and https://news.ycombinator.com/item?id=44252408, seem to be dupes. As rickydroll states, the time stamps and id numbers show it to be the first.
It's a copy of mine. Look at the timestamps.
I meant it as a joke - to blatantly steal your comment since you said copyright is evil.
So you were cosplaying an LLM trained on my comments :-)
Good one.
Seems a strange comparison - I don't think anyone claims "search engines" should be a repository of cultural memory.
Nobody intended "search engines" to be a repository of cultural memory. They became that because they were built on content and information that encompassed cultural memory, and people used them for that purpose. How many times have you told someone to Google that instead of giving them a URL?
Training sets are currently built on the same information, and now chatbots are a different way to query for that information. So, in the same way as with search engines, chatbots have become another repository of cultural memory.
At some time in the future, people will come to believe that if it's not in a search engine or a chatbot, it doesn't exist, which to me is why it's vital to put everything we know into a training set in addition to archiving it someplace that will survive a Carrington-level blast from the sun.
IMO, making multiple copies of archives of everything we know supersedes copyright.
The training sets should be public then
Yes they should
[flagged]