Телеграмм чат группы scrapy

2020 August 06

A

Andrii in Scrapy

Andrey Rahmatullin

если б ты это написал сам - не спрашивал бы

50на50% сам, не полностю еще понимаю все, но учусь

источник

15:40пожаловаться #1

АК

Александр К-ош... in Scrapy

Сегодня появилось обновление в Линукс. Ютуб в браузере перестал томозить.

источник

15:41пожаловаться #2

MH

Mohamed Ali Habib in Scrapy

Hi everyone,

I know this group is in Russian, but if i'm asking in case someone happens to speak English.

I'm working on a Scrapy project, my problem is that Scrapy is not scraping all the urls that it should. For example, i expect to get around 500 item_scraped_count but I only get something like 325 (65% exactly, I tried it on other examples)

I have tried a few things like:
CONCURRENT_REQUESTS = 1
RETRY_TIMES = 10 # default 2
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 400, 404]
# default is [500, 502, 503, 504, 522, 524, 408, 429]
DOWNLOAD_TIMEOUT = 250 # default is 180

it helped but not much, I got around 337 item_scraped_count.

This is my middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

COOKIES_ENABLED = False
DOWNLOAD_DELAY = 0.2 ( I tried higher values like 0.5, but no difference)

Do you have any ideas on what to try or what could be the problem?

Thanks.

источник

15:52пожаловаться #3

AR

Andrey Rahmatullin in Scrapy

check how many pages were scraped and what statuses, check duplicates

источник

15:58пожаловаться #4

AR

Andrey Rahmatullin in Scrapy

many useful things can be seen just by checking the job stats

источник

15:58пожаловаться #5

MH

Mohamed Ali Habib in Scrapy

Thanks for replying @wrar42

If you wouldn't mind looking, here's the stats:

{'downloader/request_bytes': 665823,
'downloader/request_count': 1712,
'downloader/request_method_count/GET': 1712,
'downloader/response_bytes': 49175801,
'downloader/response_count': 1712,
'downloader/response_status_count/200': 1701,
'downloader/response_status_count/404': 11,
'elapsed_time_seconds': 889.642693,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 6, 12, 31, 23, 317411),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 337,
'log_count/DEBUG': 3761,
'log_count/ERROR': 1,
'log_count/INFO': 26,
'log_count/WARNING': 45,
'memusage/max': 103399424,
'memusage/startup': 56680448,
'request_depth_max': 38,
'response_received_count': 1702,
'retry/count': 10,
'retry/max_reached': 1,
'retry/reason_count/404 Not Found': 10,
'scheduler/dequeued': 1712,
'scheduler/dequeued/memory': 1712,
'scheduler/enqueued': 1712,
'scheduler/enqueued/memory': 1712,
'start_time': datetime.datetime(2020, 8, 6, 12, 16, 33, 674718)}

only 11 were not found (their response were 404).

К

Thanks for replying @wrar42

If you wouldn't mind looking, here's the stats:

{'downloader/request_bytes': 665823,
'downloader/request_count': 1712,
'downloader/request_method_count/GET': 1712,
'downloader/response_bytes': 49175801,
'downloader/response_count': 1712,
'downloader/response_status_count/200': 1701,
'downloader/response_status_count/404': 11,
'elapsed_time_seconds': 889.642693,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 6, 12, 31, 23, 317411),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 337,
'log_count/DEBUG': 3761,
'log_count/ERROR': 1,
'log_count/INFO': 26,
'log_count/WARNING': 45,
'memusage/max': 103399424,
'memusage/startup': 56680448,
'request_depth_max': 38,
'response_received_count': 1702,
'retry/count': 10,
'retry/max_reached': 1,
'retry/reason_count/404 Not Found': 10,
'scheduler/dequeued': 1712,
'scheduler/dequeued/memory': 1712,
'scheduler/enqueued': 1712,
'scheduler/enqueued/memory': 1712,
'start_time': datetime.datetime(2020, 8, 6, 12, 16, 33, 674718)}

only 11 were not found (their response were 404).

Use pastebin.com for big code chunks

К

Thanks for replying @wrar42

If you wouldn't mind looking, here's the stats:

{'downloader/request_bytes': 665823,
'downloader/request_count': 1712,
'downloader/request_method_count/GET': 1712,
'downloader/response_bytes': 49175801,
'downloader/response_count': 1712,
'downloader/response_status_count/200': 1701,
'downloader/response_status_count/404': 11,
'elapsed_time_seconds': 889.642693,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 6, 12, 31, 23, 317411),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 337,
'log_count/DEBUG': 3761,
'log_count/ERROR': 1,
'log_count/INFO': 26,
'log_count/WARNING': 45,
'memusage/max': 103399424,
'memusage/startup': 56680448,
'request_depth_max': 38,
'response_received_count': 1702,
'retry/count': 10,
'retry/max_reached': 1,
'retry/reason_count/404 Not Found': 10,
'scheduler/dequeued': 1712,
'scheduler/dequeued/memory': 1712,
'scheduler/enqueued': 1712,
'scheduler/enqueued/memory': 1712,
'start_time': datetime.datetime(2020, 8, 6, 12, 16, 33, 674718)}

only 11 were not found (their response were 404).

Looks like you got an error

источник

16:12пожаловаться #8

AR

Andrey Rahmatullin in Scrapy

all 11 404 responses are for one URL, it just was retried 10 times

источник

16:19пожаловаться #9

AR

Andrey Rahmatullin in Scrapy

I would check what requests do you expect to be done and what were actually done

источник

16:19пожаловаться #10

АК

Александр К-ош... in Scrapy

Хотел поинтересоваться. Пробовал ли кто в работе разные тузлы, которые можно найти в Google по запросу:
https://www.google.com/search?as_q=python+scrapy+gui ?

Google

python scrapy gui - Google Search

источник

16:47пожаловаться #11

МС

Михаил Синегубов... in Scrapy

Александр К-ош

Хотел поинтересоваться. Пробовал ли кто в работе разные тузлы, которые можно найти в Google по запросу:
https://www.google.com/search?as_q=python+scrapy+gui ?

Google

python scrapy gui - Google Search

мой тебе совет - забей пока.
сначала с самим скрапи разберись :)

источник

16:54пожаловаться #12

АК

Александр К-ош... in Scrapy

Ок

источник

16:55пожаловаться #13

S

SoHard 🎄 in Scrapy