Page 1 of 1

Proposal: Source Information

PostPosted:02 Dec 2013, 00:17
by Tracy Poff
We've had some discussion in one place or another about how to deal with storing source information (or maybe it would be better to call it 'provenance') for our data. I've a proposal for a general way to handle this information, which I think may be suitable. I will mix a little bit of implementation details in my description, for which I apologize in advance.

The main thing I suggest is that a new type of object (and consequently a new table in the database) called, for example, SourceInformation be created. Each other type of object for which we require source information can link to one of these. This object type will have several properties:
Code: Select all
Screenshots      [Links to Screenshot entities]
Box Scans        [Links to box scan entities]
Uploaded Images  [Links to generic image entities]
Text             [Freeform description of the sources]
Source Tags      [Links to tags--more info later]
The first three fields hold (lists of) links to other objects in the database. For example, if credits are sourced to screenshots, or if publisher info is sourced to a box scan, those fields will contain links to the relevant entities. The third field, in particular, will hold any uploaded source images that don't fit into the other categories, be they low-quality images not otherwise suitable for inclusion, screenshots of web pages, or whatever else.

The 'text' field includes a textual description of the source of the information. It should describe what the attached images show and where they come from (e.g. "The credits are sourced to the endgame staff roll, of which stills are provided."), and other information to explain the source of the information. Any other relevant information can also be included here. If, for example, there's reason to doubt that a source is accurate, text can be included here explaining why there is some doubt.

The 'source tags' field links to some tag list containing relevant descriptive information. For example "Low Quality Screenshots" to indicate that we might wish to replace our source screenshots with better ones. "Release Date from Retailer" or "Uncertain Release Date" could help us find dates that might need more verification. "Source Links to Dead Website" could be useful as well. Of course, there could be many such tags we could find useful. The primary purpose of this field is to help us find entries for which the data has low-quality sources of one kind or another, by tagging them in a way that we can easily search for.

These SourceInformation objects would be linked to, as I said, from any entities in the database which would require source information. By creating dedicated objects in the database to hold this information, we won't have to duplicate all these columns on every table, since essentially the same sorts of information will need to be stored whether the sources apply to a Release Group or a Game.

Importantly, this proposal does not allow for individual properties to have individual source information separated in the database, since I am proposing a single extra column per table (though of course it would be possible to add one source column for each piece of data, I think it would be overkill). It also does not provide for a database of potential sources (i.e. there will be no object called 'the playstation store' that release dates can be sourced to).

The exact structure of the SourceInformation table that I've described above is, of course, just a preliminary example. Other fields may be needed: if Oregami stores information about magazine issues, we may well want to allow a magazine issue to be linked to as a source, for example.

The only really important thing I'm suggesting is: the descriptions of our sources should get a table in the database and be linked to when needed.

Given all these caveats, what do you think?

Re: Proposal: Source Information

PostPosted:02 Dec 2013, 00:58
by jotaroraido
I like it. A generic approach to this information means that the same format can be used for any type of contribution, keeping its effect on system complexity to a minimum.

I assume sources wouldn't be linked directly to data, but rather linked through submissions, correct? For example, a release date -- contributor A submits a game and gives only the copyright year as the release date; contributor B comes along and fills in the month from a press release, but isn't sure of the correct day; then contributor C scans their receipt from picking up their pre-order on the launch day, finally giving the correct date. The history for the data shows all three contributions -- newest first -- which each link to the source information described here for each submission.

Is this right, or am I making it more complicated than it needs to be?

EDIT: Sorry, just saw your post in the other thread. Looks like I'm on the right track here. :)

Re: Proposal: Source Information

PostPosted:02 Dec 2013, 01:36
by Tracy Poff
jotaroraido wrote:EDIT: Sorry, just saw your post in the other thread. Looks like I'm on the right track here. :)
Yep, that's about right. Of note is that since the SourceInfo things will be versioned like everything else, we'll always have the old source information available, and the current version will reflect the current, best sources for the current information.

Re: Proposal: Source Information

PostPosted:02 Dec 2013, 14:56
by MZ per X
First of all: thanks for the write-up. :)
Tracy Poff wrote:The only really important thing I'm suggesting is: the descriptions of our sources should get a table in the database and be linked to when needed.
That's absolutely the way to go, so I fully agree with your proposal on a technical level. If we add a link to a URL table, and a link to page entities for everything that has pages (gaming press, manuals, books), maybe also a property for non-image files, we should have the basics in place.
Tracy Poff wrote:(i.e. there will be no object called 'the playstation store' that release dates can be sourced to).
This could be done using source tags, too. I suppose. Or we create a separate table for recurring sources, which would also come in handy, if we identify problems with one of those recurring sources, and need to make mass adjustments.

The main process issue I have, like described in the other thread, is still that I'd like to see our provenance become as self-contained, and independent of other parts of the database, as possible. We have two cases here:

1) A user contributes some (textual) data, and uploads screenshots / box scans / page scans as a source.

a) The images uploaded do not meet our quality criteria for screenshots / scans. That one should be easy using source tags. If the uploaded images are tagged as screenshots, for example, we can include them somehow within our screenshot pages, kept separate from the quality shots.

b) The images uploaded meet our quality criteria for screenshots / scans. Then we should give the user the possibility to contribute the necessary additional information for screenshots / cover scans right away. If he/she doesn't, we should tag the image appropriately for other users to jump in.

2) A user contributes screenshots / scans which hold valuable source information that should be re-used.

a) We could link to the images within the sourcing of other contributions. This is what I don't like very much, as we would need to maintain our provenance later on, when said screenshots/scans disappear or change for whatever reason.

b) We could re-save the shots/scans in question as a source image. This would create overhead and duplication, no doubt, but could be lightened by automatically scaling down the images, or merging multiple credits screenshots into one.

What do you think? Did I miss a contribution case?

Re: Proposal: Source Information

PostPosted:03 Dec 2013, 05:27
by Tracy Poff
MZ per X wrote:That's absolutely the way to go, so I fully agree with your proposal on a technical level. If we add a link to a URL table, and a link to page entities for everything that has pages (gaming press, manuals, books), maybe also a property for non-image files, we should have the basics in place.
Yeah, I know that there will be a few other things to add, which will have to wait until we've fleshed out other parts of the database a bit more. We'll certainly revisit this all, later.
MZ per X wrote:The main process issue I have, like described in the other thread, is still that I'd like to see our provenance become as self-contained, and independent of other parts of the database, as possible. We have two cases here:

1) A user contributes some (textual) data, and uploads screenshots / box scans / page scans as a source.

a) The images uploaded do not meet our quality criteria for screenshots / scans. That one should be easy using source tags. If the uploaded images are tagged as screenshots, for example, we can include them somehow within our screenshot pages, kept separate from the quality shots.

b) The images uploaded meet our quality criteria for screenshots / scans. Then we should give the user the possibility to contribute the necessary additional information for screenshots / cover scans right away. If he/she doesn't, we should tag the image appropriately for other users to jump in.
I suppose the simplest possible solution, here, would be to have a section that just displays every piece of media uploaded as a source for a game (or for a release, creditset, whatever--somehow walk down the tree and collect everything). No need to worry about whether it's a screenshot or not, or whatever. This would be probably a little-used feature, but it would at least do something to make those images more available.

Then, as you say, ideally artifacts of sufficient quality will eventually be included also in the regular sections. We can play the long game, here, since after all the database will never be complete, anyway.
MZ per X wrote:2) A user contributes screenshots / scans which hold valuable source information that should be re-used.

a) We could link to the images within the sourcing of other contributions. This is what I don't like very much, as we would need to maintain our provenance later on, when said screenshots/scans disappear or change for whatever reason.
We could just commit to never actually deleting an image for any reason. Unlink the ones we don't like from the games, but never delete them. I'll go into the cost of this a little more, below.

The only cases where we would want to actually delete something would be if:
  • 1. The image doesn't actually have anything to do with us--someone accidentally uploads their vacation photos, or whatever.
    2. We're legally forced to remove the image.
In the first case, those images would never be used as sources anyway, so we have no problem. The second case is harder. If we do link to images from sources as in my proposal, we could very easily run a query for every SourceInfo thing that uses them, so we could at least know what would be affected, and take steps to fix the problem before deleting the images. And we could, as I think I mentioned somewhere, have some logic in the program that forbids deleting images that are used as sources, so we couldn't do it by accident.
MZ per X wrote:b) We could re-save the shots/scans in question as a source image. This would create overhead and duplication, no doubt, but could be lightened by automatically scaling down the images, or merging multiple credits screenshots into one.
I'd strongly advise against ever scaling down or otherwise reducing the quality of images, and I don't think merging would be a benefit. There are two kinds of cost to duplication:
  • 1. Maintenance work. We have to deal with contributions that revise their descriptions, or whatever.
    2. Files take up space.
You seem mostly concerned with the second problem, so I'll say briefly that I don't think the maintenance overhead from duplication would be very great. Let's look in more detail at the second issue.

Let's look at a couple of cases as exemplars.

On MobyGames, there are about 16,000 NES screenshots. The biggest NES screenshot I have on hand (in PNG format, so lossless) is about 20k, while most are smaller than 10k. Let's vastly overestimate and imagine that the average size of an NES screenshot is 50k. Then the whole set of NES screenshots on MobyGames comes in at under a gigabyte, and even if we stored three copies of every image, it'd comfortably fit on a single-layer DVD. So, not a problem.

Most older systems will be similarly small, I'd guess. But what about Windows games? They're much higher resolution, and there are lots of them. Let's again take MobyGames as an example.

There are about 150,000 Windows screenshots on MobyGames. Let's again ludicrously overestimate and say that the average size of a Windows screenshot is 5 megabytes (for comparison, my full-resolution screenshots of Fallout 3 are about a tenth this size). Then the full set of Windows screenshots on MobyGames would be about 750 gigabytes. Much bigger than the NES set, but still a totally manageable size.

Of course, since the only duplication we're talking about is a small number of source shots per game, I sincerely doubt that storing source screenshots will ever cause any difficulty related to the size of our data set. And if we were worried, we could use a single-instance storage system to mitigate duplicate file concerns completely.
MZ per X wrote:What do you think? Did I miss a contribution case?
I think you've hit everything. In case I meandered too far from my point above: though I think having high-quality and complete sets of scans and screenshots that we could refer to as sources is a great goal, I do also think that it's perfectly reasonable to retain every source image submitted indefinitely. You shouldn't trouble yourself over the size of our data set. If it should ever grow big enough to merit special consideration, we'll already be successful enough that we should be able to deal with it.

Re: Proposal: Source Information

PostPosted:03 Dec 2013, 22:46
by MZ per X
Tracy Poff wrote:I suppose the simplest possible solution, here, would be to have a section that just displays every piece of media uploaded as a source for a game (or for a release, creditset, whatever--somehow walk down the tree and collect everything). No need to worry about whether it's a screenshot or not, or whatever. This would be probably a little-used feature, but it would at least do something to make those images more available.
Yep, agreed. :)
Tracy Poff wrote:The second case is harder. If we do link to images from sources as in my proposal, we could very easily run a query for every SourceInfo thing that uses them, so we could at least know what would be affected, and take steps to fix the problem before deleting the images. And we could, as I think I mentioned somewhere, have some logic in the program that forbids deleting images that are used as sources, so we couldn't do it by accident.
I imagine a way to quickly hide single images, or a branch of them, built into the database. Thinking about it, that seems like an important feature to me, which should be included from the start. With hiding I actually mean replacing the images with a placeholder picture, and maybe a text explaining the reasoning for the hiding, too.

The provenance could be left unaffected by that, or could be hidden separately in severe cases.
Tracy Poff wrote:There are about 150,000 Windows screenshots on MobyGames. Let's again ludicrously overestimate and say that the average size of a Windows screenshot is 5 megabytes (for comparison, my full-resolution screenshots of Fallout 3 are about a tenth this size). Then the full set of Windows screenshots on MobyGames would be about 750 gigabytes. Much bigger than the NES set, but still a totally manageable size.
Thanks for doing the math. :) I agree that this would be manageable.
Tracy Poff wrote:I think you've hit everything. In case I meandered too far from my point above: though I think having high-quality and complete sets of scans and screenshots that we could refer to as sources is a great goal, I do also think that it's perfectly reasonable to retain every source image submitted indefinitely.
Okay, so with a hiding facility in place like I described above, I agree that both ways are possible for case # 2. That basically means that I can leave the decision about this issue up to the developers, whatever solution is easier to implement for them. 8)
Tracy Poff wrote:You shouldn't trouble yourself over the size of our data set. If it should ever grow big enough to merit special consideration, we'll already be successful enough that we should be able to deal with it.
:D Yeah, now that would be a luxury problem.