Cleaning I need data :)

Dysnomia

Registered User
初心者/ Shoshinsha / Beginner
Joined
Jun 9, 2019
Messages
1
Reaction score
0
Age
29
Country
France
I everyone, let me introduce myself, i'm Jonathan, a IT Developer (Java/Python) from france.

I really love reading mangas and i know cleaning is something that really slow translation process
so i would be really happy if i could help you translate faster ;)

That's why i'm trying to create a open source (free to use/modify) AI for scans cleanning on my free time.
the problem is that training a IA requier a large amount of data (the source scan and the clean one with no text).

So here is my request, do you know a way to find this data ? :)
 

Gradonil_Ral

Scanlator
中級員 / Chuukyuuin / Member
Joined
Jun 4, 2008
Messages
161
Reaction score
59
Gender
Male
Country
Winterfell
TBH, I don't exactly see that happening.

There are three types of raws:
1. Magazine.
2. Tankoubon.
3. Webraws.

Quality of magazine and tank raws may differ based on the type of paper or ink that's been used for the print. Another factor is the type of scanner that's been used - flatbed photo scanners usually give the best results, document scanners - much worse.
Usually, raw providers will scan two pages at a time to cut the scanning time in half. Some will separate the pages themselves, some will leave it to cleaners.
Either way, raws first need to be rotated and cropped. Only then will the proper cleaning procedure begin.

You might have trouble finding large quantities of rotated/cropped raws out there. Your best bet would be webraws, but they only require text removal, so your AI wouldn't learn much else.

In any case, you could try contacting some older scanlation groups - even the inactive ones. If you manage to find their old admins, they might be willing to share their old files with you. Current groups might be a problem - local scanlations tend not to credit original scanlators, which doesn't make them eager to give out their files for free.
The better groups should even have their PSDs with a raw layer beneath the clean one - which would be the perfect thing for you, I guess.
 
Top