耗時一週,我統計出來了npm下載量前30的倉庫,第一竟是它!

語言: CN / TW / HK

作為一個前端開發人員,我們每天都在使用npm,但是你曾經是否和我一樣好奇,下載量最大的包是哪個?每天下載多少次?他們的github star是多少?上週我偶然看到了一個庫 glob, 每週竟然下載8000萬次,與此同時,react只有1500萬次,glob是最高的嗎,第一又是誰呢?

統計結果展示

耗時一週,我統計出來了npm下載量前30的倉庫,第一竟是它!supports-color!總下載量 26,108,633,482 次,但 github star 竟然只有 319 個。另外,我做了一個網站,統計了最近一週、最近一月、最近一年、總下載量等各個維度的圖表,還沒有做優化加載可能有點慢,網站地址 https://www.npmrank.net/

無圖無真相,下面是網站截圖 image

image

分析npm官網接口,獲取某個包的下載量

打開瀏覽器控制枱分析npm接口發現,同一個地址,比如 https://www.npmjs.com/package/lodash , 從npm首頁 Popular libraries 中推薦的庫點進去,接口返回的是JSON格式的數據,而從地址欄輸入鏈接進去,返回是服務端渲染好的html。多次控制變量法未能定位是哪個header的原因,我就先不管了(當然不是睡大覺) 1. 找到返回JSON的接口,copy -> copy as fetch image 2. 粘貼到console image 3. 複製header到postman,同時看到有下載量數據 image 4. 打開postman右側的代碼塊,找到python代碼 image 5. 複製到test.py,去掉某些空的header image

OK,這樣獲取某一個倉庫的接口就完成了,通過這個接口我們可以拿到github地址,倉庫版本,最近一年每週的下載量等

根據npm官方api,獲取不同時間段的下載量

上面官網的接口只是最近一年各周的下載量,有沒有其他時間段的呢,找了一圈後發現npm官網提供了這樣的接口,官方api文檔image 通過上面提供的接口,我們可以獲取上週、上月、任何一個時間段的下載量,但是需要注意的是,官方api每次最多返回18個月的數據,最早是2015-01-10號的數據,所以統計總下載量時要分段獲取每年的下載量後再累加,如果你想統計自己的包被安裝了多少次,也是可以滴,接下來就是獲取很多包名,循環下載後統計了

獲取19年的排行

我在網上搜了一下npm downlaod rank,發現只有 anvaka 19年做的統計符合想要的結果,他下載了npm全部的包並做了各種維度的分析,這個md是他統計的 top 1000依賴的包,不過被依賴的越多下載量越大,誤差應該不會很大 image 1. 保存文件到本地 SOURCE_FILE 2. 獲取包名和倉庫地址並存到sqlite數據庫 python ''' 從md中拿到庫名並存到數據庫 ''' with open(SOURCE_FILE, 'r') as f: lines = f.readlines() for line in lines: name = re.findall(r'\[(.*?)\]', line) href = re.findall(r'\((.*?)\)', line) print('line\n', line) if name and href: get_pkgbase_query = '''SELECT * FROM pkgbase WHERE id = ?''' record_base = sql_obj.get(get_pkgbase_query, (name[0],), one=True) if record_base is None: insert_data_query = ''' INSERT INTO pkgbase ('id', 'npm_url', 'github_url', 'homepage_url', 'version', 'license', 'github_star', 'size', created, updated) VALUES(?,?,?,?,?,?,?,?,?,?) ''' sql_obj.update(insert_data_query, (name[0], NPM_BASE_URL + name[0], '', '', '', '', 0, '', 0, 0)) image

  1. 循環請求存儲基本數據 ```python ''' 更新下載量 ''' async def main(): all_data_query = '''SELECT * FROM pkgbase''' records = sql_obj.get(all_data_query) for index, record in enumerate(records): while True: print('id', record['id'], index) try: ''' 獲取下載量並寫入數據庫 ''' href = NPM_BASE_URL + record['id'] npm_response = requests.request("GET", href, headers=npm_headers) npm_data = npm_response.json()

        # pkgbase
        github_url = npm_data['packageVersion'].get('repository', '')
        homepage_url = npm_data['packageVersion'].get('homepage', '')
        version = npm_data['packument'].get('version', '')
        license = npm_data['packument'].get('license', '')
        # 有倉庫兩個license
        license = license if type(license) == str else '-'
        versions = npm_data['packument'].get('versions') if npm_data['packument'].get('versions') else []
        updated = datetime.datetime.fromtimestamp(versions[0]['date']['ts'] / 1000).strftime("%Y-%m-%d %H:%M:%S")
        created = datetime.datetime.fromtimestamp(versions[len(versions) - 1]['date']['ts'] / 1000).strftime("%Y-%m-%d %H:%M:%S")
    
        update_pkgbase_query =  '''
                                UPDATE pkgbase
                                SET github_url = ?, homepage_url = ?, version = ?, license = ?, updated = ?, created = ?
                                WHERE id = ?
                                '''
        sql_obj.update(update_pkgbase_query, (github_url, homepage_url, version, license, updated, created, record['id']))
    

    4. 更新各個時間段的下載量python ''' 獲取某一時期的下載量 ''' def get_point_downloads(date_range, package_name): href = f'{NPM_BASE_API_POINT_URL}{date_range}/{package_name}' response = requests.request("GET", href) data = response.json() return data['downloads']

    ''' 獲取全部下載量,npm每次最多返回18個月的數據,所以分段下載後再累加 ''' def get_point_all_downloads(package_name): start_time = 2015 end_time = datetime.datetime.now().year all_downloads = 0

    for year in range(start_time, end_time + 1): dltype = f'{year}' date_range = f'{year}-01-01:{year + 1}-01-01' print('date_range', date_range)

    downloads = get_point_downloads(date_range, package_name)
    all_downloads += downloads
    print('new downloads',downloads)
    add_data_query = '''
                    INSERT INTO pkgdownload
                    ('id', 'dltype', 'downloads', 'timepoint')
                    VALUES(?,?,?,?)
                    '''
    sql_obj.update(add_data_query, (package_name, dltype, downloads, datetime.datetime.now()))
    

    return all_downloads

    ...

    pkgdownload

    base_dltype = ['last-day', 'last-week', 'last-month', 'last-year', 'all-time'] for dltype in base_dltype: if dltype == 'all-time': downloads = get_point_all_downloads(record['id']) else: downloads = get_point_downloads(dltype, record['id']) print('dltype', dltype) print('downloads', downloads) replaced_dltype = re.sub(r'-', '_', dltype) add_pkgdownload_query = ''' INSERT INTO pkgdownload ('id', 'dltype', 'downloads', 'timepoint') VALUES(?,?,?,?) ''' sql_obj.update(add_pkgdownload_query, (record['id'], replaced_dltype, downloads, datetime.datetime.now())) ```

獲取包的github數據

本來官網接口中返回的有ghapi字段,如 https://api.github.com/repos/lodash/lodash ,裏面有stargazers_count字段就是star數,但是該接口每小時限速60次,所以無奈只能用爬蟲了,代碼如下 python def set_github_info(github_url, package_name): response = requests.get(github_url, headers=github_headers) soup = BeautifulSoup(response.content, "html.parser") star = soup.find("span", class_='text-bold').get_text() if soup.find("span", class_='text-bold') else 0 update_pkgbase_query = ''' UPDATE pkgbase SET github_star = ? WHERE id = ? ''' print('package_name star', package_name, star) sql_obj.update(update_pkgbase_query, (star, package_name)) 第一次使用爬蟲庫 bs4 的 BeautifulSoup 模塊,獲取 github star 只有兩行代碼,也太方便了吧

就在剛才發現npm也有接口會返回github star數,如 https://api.github.com/repos/lodash/lodash/pulls?per_page=1 裏的 stargazers_count ,等有時間我替換一下

開啟服務

經過上面一通操作,我們現在有了pkgbase、pkgdownload 這樣兩張表,內容如下 image image

接下來寫兩個接口,一個是返回下載量排名的的類型,過去一週,過去一年,總下載量等,供前端篩選,使用quart簡單起個服務

```python from quart import Quart, request import re

from db import SQLDB

app = Quart(name) sql_obj = SQLDB()

''' 獲取排名類型 ''' @app.route('/api/ranking/types') async def get_types(): return { 'code': 200, 'data': get_rank_types(), 'success': True }

def get_rank_types(): get_types_query = 'SELECT DISTINCT dltype FROM pkgdownload' records = list(map(convert_type, sql_obj.get(get_types_query)))

return records

def convert_type(record): dltype = re.sub(r'_', '-', record['dltype']) return { 'label': dltype, 'value': dltype }

if name == 'main': app.run(host='127.0.0.1', port=8080) ``` 根據排名類型,返回對應的排行數據 image

```python ''' 獲取包的數據 ''' @app.route('/api/ranking/packages/') async def get_packages(rank_type): top = request.args.get('top') if top is None: top = 30 elif int(top) > 200: top = 200 else: top = int(top) rank_types = get_rank_types() rank_type = next((c['value'] for c in rank_types if c['value'] == rank_type), None)

if rank_type: rank_type = re.sub(r'-', '_', rank_type) get_data_query = ''' SELECT a.id, npm_url npmUrl, github_url githubUrl, homepage_url homepageUrl, dltype dltype, downloads downloads, github_star githubStar, version, license, updated, created FROM ( SELECT id, dltype, downloads FROM pkgdownload WHERE dltype = ? ORDER BY downloads DESC LIMIT 0, ? ) a, pkgbase b WHERE a.id = b.id ''' records = sql_obj.get(get_data_query, (rank_type, top))

for index, record in enumerate(records):
  records[index]['rank'] = index + 1

return {
  'code': 200,
  'data': records,
  'success': True
}

``` image

彩蛋

如果你看了上面開啟服務的的代碼,你可能會發現獲取排行數據的接口其實還有一個top參數,最大是200條,但是由於圖表不方便展示這麼多的數據,如果你想自己看一下前200都有哪些包,可以複製接口改一下,如 https://www.npmrank.net/api/ranking/packages/last-day?top=200 ,如果你想查看超過200的排行,可以打開database.db的pkgdownload表查看

結束

以上就是獲取npm排行的整個流程了,如果感覺有意思的話歡迎點個贊或者star,後端倉庫地址 npmrank ,在線體驗網頁鏈接 https://www.npmrank.net/