Statistics for text-only filter algorithm improvement?

by tfmorris

This blog post from June says that a text-only filter was run over the corpus and that it improved the user experience. Has before/after data been published anywhere? If not, could you publish it please? It being a science project, data is always appreciated.

I just hand counted the images on my recent classifications page and found 193 pages with no images (of which 5 were completely blank) and 11 pages with images or tables. That's a 95% misclassification rate for those pages classified as containing images (in might be a tiny misclassification rate overall, but that's something else the stats would help show).

Posted October 13, 2015 6:25 PM
by tfmorris

Just got a mailing saying I should help more. Perhaps Victoria & team should consider answering questions on the discussion boards as a way to encourage people's ongoing involvement in the project.

Posted October 19, 2015 4:47 PM
by geoffrey.belknap scientist

Hi tfmoris,

Thanks for your question. Following on from our blog post we have done a test trial of a image filtering algorithm, which we applied to Quarterly Journal of the Geological Society of London. However, we have yet to apply this to the whole upload dataset. We are planning - for our next upload - to bring out the algorithm on a larger scale.

That being said - there are some downsides of not having non-image pages. There are still some very interesting aspects of historical citizen science which some of the users are coming across. This page, brought to talk by @David Goldfarb, for instance, is a great description of the need for community based research, which we would never have found if we had applied to algorithm to the journal. http://talk.sciencegossip.org/#/subjects/ASC0000piz

So, while blank pages and non-illustrated pages may be sometimes tedious to classify (especially if there is a number in consecutive order) they are still helpful for historians such as myself. We shouldn't think of them as misclassifications, but as opportunities to find other useful and interesting material.

So far, we have been very lucky to have such a dedicated and helpful group of citizen scientists and we welcome anyone with questions, concerns or who want to help out in any way!

Geoff

Posted October 19, 2015 8:02 PM
by geoffrey.belknap scientist

Also forgot to say - sorry if we missed responding to your message before. We try to keep up with everyones questions - and the moderators are fantastic helps with that - but sometimes a few questions slip through.

Posted October 19, 2015 8:09 PM
by VVH scientist, admin in response to tfmorris's comment.

Thanks for your posts @tfmorris. I had seen your initial post and discussed it with with some of the Science Gossip team, but then our response fell through the net. The reason is, in part, that the answer is a little complicated, but I'll give it a go. Before I do so, please accept my apology for the delay in replying. This is a really good question.

As Geoff says, we did filter part of the dataset, but not all. The reason was that on another Zooniverse project, Snapshot Serengeti, some of my colleagues noticed that people classified at a steadier pace when there were more blanks (ie when the dataset was first released) and submitted fewer classifications when the number of images with animals increased over time in proportion to the blanks, as the blanks get weeded by volunteers. It's not that volunteers slowed down for each classification (you would expect a 'no' to take less time than a descriptive classification of a 'yes, there is something here' sort). Rather, when there was more to do, people stayed for a shorter period of time, and seemed to enjoy the experience less.

One of the conclusions drawn by my colleagues, which they've written up in a forthcoming paper that will appear in November of this year (and which I can post as a follow up!) was that there is a good ratio (project dependant perhaps?) of blanks or 'not much to do here' subjects, to subjects that are more involved. Removing all of the blanks/text only pages can make projects feel overwhelming or like a chore.

I may not be representative, but I know for myself I enjoy a run of blanks on SG and Snapshot, and then feel a little thrill when I find something to do. I'd be very curious to know what you think would be the optimum number of blanks? I suspect zero! And you're not alone. @Quia, who inspired us to remove the blanks/text only the last time around, felt that blank pages waste people's time.

So, all of that said, I owe you a proper blog post once that paper is out in late November. In the meantime I hope to hear from you about how many/if any blanks you think are ok. If you are interested in discussing the Snapshot results before the paper is published I can put you in touch with the authors and depending on what you think of that, maybe we can all collaborate on a piece of research into this area in the future? Apparently the question of blanks is on the cutting edge of crowdsourcing research, and I can't say that we got it right this time in SG. Keep in touch and thanks again for your interest and help on the project.

All best,
Victoria

Posted October 19, 2015 9:10 PM
by yshish moderator

Thank you both for making this clear! Can't wait for the blog post.

Posted October 19, 2015 9:19 PM
by jules moderator

Just out of curiosity, is there an algorithm that can distinguish between blanks and all text pages? I personally think that blanks are a bit of a waste of time but pages full of text are worth leaving in - they are sometimes more interesting than the illustrations!

Posted October 19, 2015 9:21 PM
by VVH scientist, admin

I'm not sure @jules, but I suspect so. Definitely one to look into. I'll raise it with the SG team at the BHL and see if their algorithm is up to the task.

Posted October 19, 2015 9:26 PM
by alexbfree

Hi all. @vvh Thanks for the mention of the research I have been doing with MICO!

Here is the work-in-progress paper on the importance of blanks. This is not citable or officially published yet but people are welcome to read it.
https://github.com/alexbfree/zooniverse-files/blob/master/papers/work-in-progress/Blanks-WIP-HCOMP-2015.pdf

If people would like more info, there's background in the blog posts here:
- http://www.mico-project.eu/snapshot-serengeti-season-8-ready-for-mico/
- http://www.mico-project.eu/snapshot-serengeti-an-unexpected-discovery/
If anybody has questions, don't hesitate to contact me!

alex.bowyer @zooniverse.org

Web Science Architect at the Zooniverse

Posted October 20, 2015 11:16 AM
by jules moderator

Thanks @alexbfree that's really interesting! I wonder what a similar study would look like for Science Gossip using "real" blanks versus text only pages? So much work, so little coffee. 😉

Posted October 20, 2015 11:31 AM
by tfmorris

Thanks for the replies everyone! I'm glad I wasn't being ignored. 😃

@geoffrey.belknap - Serendipity is a wonderful thing, but crowdsourcing applications tend to, in my experience, benefit from very careful tuning to the task at hand. Perhaps you would have also discovered that page if you had a crowdsourced OCR correction task running on the text at http://biodiversitylibrary.org/page/13271447 #page/68/mode/1up (Anyone else notice that the "Go to BHL" link is always off by one page?)

@VVH - Perhaps some people (like me!) are just wired differently. I've classified 500+ Serengeti images and find a full frame of waving grass as annoying as a blank or text-only page here, but I cut them some slack because it's a much more difficult classification problem (plus the trap was triggered by something, even if it was blowing foliage). In glancing through my past SG images, I'd guess I had something like 50% frames with no animals which is very different from the 95% blanks I found here (plus an empty Seregeti plain is qualitatively more interesting than page of unread text).

@alexbfree - Improved user analytics (Geordi) should definitely make tuning of the user experience based on data easier than having to guess about black box behavior. I've got tons of questions about the blanks study though. Top of the list is how transferable the results are between two dramatically different tasks (PS & SG). Do the results hold even when you scale up to the number of blanks seen here (95%)? Is session length really the important metric? What about number of good judgements? What happens across sessions? Do the users with a high number of blanks have a really long session, perhaps until they get a single non-blank image, and then never return again? Were any followup surveys done with the users to see how they subjectively rated their satisfaction with the task at varying levels of blanks?

@jules - If some people are interested in reading text-only pages rather than identifying figures, perhaps that could be made a profile setting to allow people to choose which behavior they want. Personally, if I was interested in reading the text, I'd go to the originals at the Internet Archive (the source that the BHL puts their facade on top of) and read the journals in linear fashion.

Thanks again for the answers and pointers!

Posted October 21, 2015 6:51 PM
by tfmorris

So, it looks like at least a portion of the reason for the high frequency of blank pages was a garden variety bug in the back end. I suspect this could also have contributed to the reports of people seeing pages multiple times.

http://talk.sciencegossip.org/#/boards/BSC0000006/discussions/DSC000041m
https://github.com/zooniverse/BHL/issues/42

Pages which were classified as being blank by the machine algorithm were not retired and were presented to users 20 or more times for classification, not surprisingly resulting in 20/20 votes for no_illustration.

Posted October 22, 2015 6:02 PM
by yshish moderator in response to tfmorris's comment.

Awesome work on your side, @tfmorris !! Thanks.

Posted October 22, 2015 7:22 PM
by VVH scientist, admin

This is something I think we should try to follow up when we all get a bit of breathing room. Maybe December? What do you say @alexbfree?

Posted October 28, 2015 4:32 PM