Python data mining study notes (12) Taobao picture crawler combat

so 图片爬虫, is the crawler program that automatically crawls the pictures on the other server from the Internet.

一, Picture link analysis before the crawler

1. First open the Taobao homepage, enter keywords in the search box, such as "Shenzhou", click the next page in the search results interface, respectively open the first Search results for one page, second page, and third page, and write down the URL of each page result to Notepad, as follows:

2. Observe the URL of each page, do not observe the different parts of them. Instead, look at the similar parts in each URL.

(1) It can be noted that each URL has a "s=XXX" part, which is assumed to represent the value of a different page number, 0 for the first page, 44 for the second page, and 88 for the first page. On three pages, speculative 132 represents the fourth page. By changing "s=0" in the URL of the first page to "s=132", you can find the magical jump to the fourth page.

(2) Although you can't see the keyword after copying the URL, you can clearly see "q=XXX" as the input keyword content in the browser. It can be speculated that the browser actually gets the webpage. Chinese characters are encoded.

3. So you can initially imagine the structure of the web link required by the image crawler: (modify on the basis of the URL of any page)

二, image link analysis before the image crawler

1 Right click on the picture on the Taobao webpage, click on Copy Image Address, paste it into Notepad and analyze it:

3. Observe the URL, notice that the first half is the address of the image resource in the server, and the second half is the image name. And its format, especially "250X250" represents the resolution of the picture, because in the Taobao search page, in order to save resources, take the way of thumbnails.

4. Search the core part of the image URL, such as "TB2ISTydyCYBuNkSnaVXXcMsVXa" in this example, in the source code page:

Copy the link and open it, you can find that the HD big picture has no shape :

5. Observe the format of the image link before and after, notice that it starts with pic_url":", and ends with ", the content of this example is relatively simple, you can get the image link without capturing the package.

  Third, the image crawler program written

import urllib.request
Import re
Keyname="Shenzhou"
Key=urllib.request.quote(keyname) #编码
#Try to crawl the first three pages
For i in range(0,3):
    #构造页URL
    Url="https://s.taobao.com/search?q="+key+"&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180915&ie=utf8&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s="+str(i*44 )
    Data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
    Pat='pic_url":"//(.*?)"'
    #Get image URL
    Imagelist=re.compile(pat).findall(data)
    For j in range(0,len(imagelist)):
        Thisimg=imagelist[j]
        #构造图片URL
        Thisimgurl="http://"+thisimg
        File="F:/taobaoIMG/"+str(i)+str(j)+".jpg"
        Urllib.request.urlretrieve(thisimgurl,filename=file)

Known problem: Crawling the resulting image may be a poorly matched image. For example, in this case, there may be a model such as the Shenzhou spacecraft and the Shenzhou bird electric car. The result of the class may be the anti-crawling block made by the Taobao website, and the gods are welcome to give pointers.

Thanks for the guidance of Wei Wei teacher