-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed issue with links not being found #298
base: master
Are you sure you want to change the base?
Conversation
Google recently changed the way they present the image data, and so the links were no longer being scraped. I figured out how to get the image urls with the new system and made the appropriate changes so it would work. Unfortunately, google no longer provides file format data so I had to try and retrieve it from the url of the image, which does not work in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this will only get the first 100 images, correct?
The rest of the images get dynamically loaded through the batchexecute call.
Sorry, I wasn't downloading more than 100, so I didn't think about this. I have not tested if this works with above 100, but my guess is it will not. However, I know the below 100 does not work without these changes. |
cool, well 100 is much better than 0 :)
…On Wed, Feb 5, 2020 at 3:31 PM Joe Clinton ***@***.***> wrote:
Seems like this will only get the first 100 images, correct?
The rest of the images get dynamically loaded through the batch execute
call.
Sorry, I wasn't downloading more than 100, so I didn't think about this. I
have not tested if this works with above 100, but my guess is it will not.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#298?email_source=notifications&email_token=ANEQBTLQ4B477L5555465TTRBND4XA5CNFSM4KQTN5ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK5LZ6Y#issuecomment-582663419>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANEQBTJTRZPLUTUBB3YYNGDRBND4XANCNFSM4KQTN5ZA>
.
|
I got everytime this error after circa 20 downloaded images. Traceback (most recent call last): |
Hey, Much like MarlonHie, |
I made a quick fix for the NoneType error. I was working on a project using this so I needed it to work again rapidly. Still working only under 100 images though. |
Sorry, for not replying faster, the none-type thing is because every so often a item with a null value for the image data is given. Fortunately, all of these items are marked with 2 in the data[0] column, so I will just remove them. This should fix the problem. Rian-T's solution also works. |
By filtering out the image objects which had data[0]==2, I have removed the null items and it will no longer give the error: "TypeError: 'NoneType' object is not subscriptable".
I am still getting these errors with the latest Joeclinton1 version: File "google_images_download.py", line 1017, in |
This system is not very flexible, it seems google does not keep the same positions of target items, so sometimes it doens't work. I added a try-except just in case there are more problems
I ran with 20 queries and some returns this exception:
|
Hi all, For time being the probable fix is to add image downloader extension to your chrome browser (https://chrome.google.com/webstore/detail/image-downloader/cnpniohnfphhjihaiiggeabnkjhpaldj?hl=en-US). Thanks. |
I believe the solution I have is too inflexible for deployment, as google does not seem to keep a stable enough structure to the databack send in the callback. A different solution, perhaps one which collects links which are not thumbnails inside the callback might work better. |
How do you import this fixed version and run it? |
there isn't a working solution right now. |
I've been trying to get limit > 100 to work. It seems selenium's browser.page_source returns lots of new lines compared to the other raw_html you typically get. I've tried stripping newlines off, but no success. Eventually it will search for: "AF_initDataCallback({key: \'ds:2\'" but returns -1. If I search just "AF_initDataCallback" I can get a start index, but this will still just result in JSONDecodeError. So it seems the entire raw_html from download_extended_page is getting parsed incorrectly. EDIT: Converting the string to a bytearray and back to a string allowed the image_objects to parse correctly. len(image_objects) was only 100 though so maybe selenium isn't scrolling far down enough? Will keep looking... EDIT2: It seems my string from download_extended_page is larger, but object length staying at 100. Running with short length vs length > 100, the delta between the start and stop indexes is ~122400 for both raw_html after parsing. So no new images seem to be actually included with the expanded page_source despite it being a larger string. |
Unfortunately, it appears the google image formatting has been changed this is a temporary solution from "hardikvasa/google-images-download#298" Change-Id: Iadcfa995e6b7c6229505ec0872810876575d738e
Unfortunately, it appears the google image formatting has been changed this is a temporary solution from "hardikvasa/google-images-download#298" Change-Id: Iadcfa995e6b7c6229505ec0872810876575d738e Signed-off-by: goodmeow <[email protected]>
Unfortunately, it appears the google image formatting has been changed this is a temporary solution from "hardikvasa/google-images-download#298" Change-Id: Iadcfa995e6b7c6229505ec0872810876575d738e Signed-off-by: goodmeow <[email protected]>
Unfortunately, it appears the google image formatting has been changed this is a temporary solution from "hardikvasa/google-images-download#298" Change-Id: Iadcfa995e6b7c6229505ec0872810876575d738e Signed-off-by: goodmeow <[email protected]> scrappers.py:
Getting an error with every try, for example: googleimagesdownload --keywords "ty cobb" --limit 10 Python 2.7.17 |
@RetroSeasons pip uninstall google-images-download and then run setup.py again |
I'm now getting this error too. I've run the command multiple times and it always works in the beginning but then the error appears at random. Sometimes it's after the first 1 or 2 keywords - the most its gone up to is around 30 keywords before it gives me the error. list index out of range |
I am also getting this error. list index out of range It seems to happen after at least 2 keywords then it fails somewhat randomly at the start of any keyword afterwards. |
im getting this error, the same as @Jerick5555 . Evaluating... I've proved in a virtual machine and I'm getting the same error. It's very strange because yesterday I used the program and it worked fine... if anyone comes up with something let me know. Btw I have Ubuntu 22. Update: I executed the test provided in the project like this: |
@mrclean789 I ran test_google_images_downloads.py and was able to reproduce the error. Thank you for alerting me! The issue is likely caused by google once again changing the way they format their image object array. |
Great. I hope you find the time. If I knew Python I would try to fix it. Thank you! |
upstream updates
@Joeclinton1 looks like they changed it. I found the issue and am fixing it, I'll raise a PR. Update: Joeclinton1#26 |
It seems that the download list is always empty now since yesterday or the day before. This is using the joe clinton version. It was working for quite some time with some strange periodic problems that would occur for 1/2 a day at a time and then disappear, but since yesterday no search term has downloaded anything for me. Are others finding this? |
|
fix breaking change due to google's response format
Getting this error , did anyone find solution to this? Evaluating... |
is there a way to encode the returned metadata, I get \u05de\u05d3\u05d5\u05d6\u05d4 \u05d7\u05d5\u05e3 \u05d0\u05e9\u05d3\u05d5\u05d3 instead of Hebrew, i tried adding |
Evaluating... Got this error too |
nvm, i updated to the latest version and it is working now. @modikush80 |
just pull from the repo and do the setup again |
As of currently, I think google has changed their JSON again and it no longer works. I am currently very busy and have not had a chance to fix it, but on the github there are a few PR's which claim to have fixed the problem: https://github.com/Joeclinton1/google-images-download/pulls I will test these at some point, but in the mean time if you need it to work you may consider one of their forks. If it works for you please tell me and I'll just merge their fork. Thank you for your understanding. |
I've tried Joeclinton1#35 and it works for me (using it as part of https://github.com/galantra/FluentForeverVocabBuilder/) |
I am working on a project that depends heavily on this functionality. I refactored it and am maintaining it here https://github.com/ellisbrown/google-images-download/tree/wrapperless if it helps anyone |
Doesnt work for me what are the correct instructions to use the updated version? I used the following:
I get
|
@copperwiring see my above comment for a working fork. the following worked for me just now: git clone [email protected]:ellisbrown/google-images-download.git
cd google-images-download
pip install .
python tests/test_google_images_download.py --limit 10 |
Google recently changed the way they present the image data, and so the links were no longer being scraped.
I figured out how to get the image urls with the new system and made the appropriate changes so it would work.
Unfortunately, google no longer provides file format data so I had to try and retrieve it from the url of the image, which does not work in some cases.
EDIT: Since this keeps being asked, here's the code to download the patch for windows: