AI算法工程师手册、Google软件开发指南、全国高校生活质量信息汇总 | ShowMeAI资讯日报 #2022.07.02

语言: CN / TW / HK

ShowMeAI日报系列全新升级!覆盖AI人工智能 工具&框架 | 项目&代码 | 博文&分享 | 数据&资源 | 研究&论文 等方向。点击查看 历史文章列表,在公众号内订阅话题 #ShowMeAI资讯日报,可接收每日最新推送。点击 专题合辑&电子月刊 快速浏览各专题全集。

1.工具&框架

工具:Jupyter Splitview - jupyter图像切分视图

'Jupyter Splitview - Making before/after image sliders in JupyterLab' by Jan-Hendrik Müller

GitHub: https://github.com/kolibril13/jupyter-splitview

工具平台:Spring Cloud Tencent - 腾讯开源的一站式微服务解决方案,实现了Spring Cloud 标准微服务 SPI

‘Spring Cloud Tencent - Spring Cloud Tencent is a Spring Boot based Service Governance Framework provided by Tencent, including service discovery, traffic control, circuitbreak, ratelimit, config and so on.' by Tencent

GitHub: https://github.com/Tencent/spring-cloud-tencent

工具库:cuNumeric - 支持跨越多节点/多GPU的NumPy替代库

'cuNumeric - An Aspiring Drop-In Replacement for NumPy at Scale' by Legate

GitHub: https://github.com/nv-legate/cunumeric

工具库:JAX强化学习环境

'Reinforcement Learning Environments in JAX' by Robert Lange

GitHub: https://github.com/RobertTLange/gymnax

工具系统:Flax - 重在灵活性的JAX高性能神经网络开发库及生态系统

'Flax: A neural network library and ecosystem for JAX designed for flexibility - Flax is a neural network ecosystem for JAX that is designed for flexibility.' by Google

GitHub: https://github.com/google/flax

2.博文&分享

电子书:《AI 算法工程师手册

作者华校专,曾任阿里巴巴资深算法工程师、智易科技首席算法研究员,现任腾讯高级研究员,《Python 大战机器学习》的作者。本电子书覆盖数学基础、统计学习、深度学习、相关工具等内容。

地址: http://www.huaxiaozhuan.com/

电子书:google的软件开发指南

'《Software Engineering at Google》的中文翻译版本' by qiangmzsx

GitHub: https://github.com/qiangmzsx/Software-Engineering-at-Google

GitHub: https://qiangmzsx.github.io/Software-Engineering-at-Google

3.数据&资源

分享:全国高校生活质量信息汇总

一些大学的生活质量 - 收集全国各高校招生时不会写明,却会实实在在影响大学生活质量的要求与细节

GitHub: https://github.com/CollegesChat/university-information

4.研究&论文

公众号后台回复关键字 日报,免费获取整理好的论文合辑。

论文:I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation

论文标题:I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation

论文时间:22 Jun 2022

所属领域:计算机视觉

对应任务:Multi-Person Pose Estimation,Pose Estimation,对人姿态检测,姿态检测

论文地址https://arxiv.org/abs/2206.10892

代码实现https://github.com/leijue222/Intra-and-Inter-Human-Relation-Network-for-MPEE

论文作者:Yiwei Ding, Wenjin Deng, Yinglin Zheng, PengFei Liu, Meihong Wang, Xuan Cheng, Jianmin Bao, Dong Chen, Ming Zeng

论文简介:In this paper, we present the Intra- and Inter-Human Relation Networks (I^2R-Net) for Multi-Person Pose Estimation./在本文中,我们提出了用于多人姿势估计的人关系网络(I^2R-Net)。

论文摘要:In this paper, we present the Intra- and Inter-Human Relation Networks (I^2R-Net) for Multi-Person Pose Estimation. It involves two basic modules. First, the Intra-Human Relation Module operates on a single person and aims to capture Intra-Human dependencies. Second, the Inter-Human Relation Module considers the relation between multiple instances and focuses on capturing Inter-Human interactions. The Inter-Human Relation Module can be designed very lightweight by reducing the resolution of feature map, yet learn useful relation information to significantly boost the performance of the Intra-Human Relation Module. Even without bells and whistles, our method can compete or outperform current competition winners. We conduct extensive experiments on COCO, CrowdPose, and OCHuman datasets. The results demonstrate that the proposed model surpasses all the state-of-the-art methods. Concretely, the proposed method achieves 77.4% AP on CrowPose dataset and 67.8% AP on OCHuman dataset respectively, outperforming existing methods by a large margin. Additionally, the ablation study and visualization analysis also prove the effectiveness of our model.

在本文中,我们提出了用于多人姿态估计的人关系网络(I^2R-Net)。它包括两个基本模块。首先,人内关系模块在一个人身上运作,旨在捕捉人内的依赖关系。第二,人际关系模块考虑了多个实例之间的关系,重点是捕捉人与人之间的互动。通过降低特征图的分辨率,人关系模块可以设计得非常轻巧,但却能学习到有用的关系信息,从而大大提升人际关系模块的性能。即使没有花哨的东西,我们的方法也可以竞争或超过目前的竞争者。我们对COCO、CrowdPose和OCHuman数据集进行了广泛的实验。结果表明,所提出的模型超过了所有最先进的方法。具体来说,所提出的方法在CrowPose数据集和OCHuman数据集上分别取得了77.4%和67.8%的AP,以很大的优势超过了现有的方法。此外,消融研究和可视化分析也证明了我们模型的有效性。

论文:ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

论文标题:ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

论文时间:20 Apr 2022

所属领域:语音

对应任务:Disentanglement,Self-Supervised Learning,自我监督学习

论文地址https://arxiv.org/abs/2204.09224

代码实现https://github.com/auspicious3000/contentvec

论文作者:Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

论文简介:Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks./语音中的自监督学习包括在大规模未注释的语音语料库上训练一个语音表示网络,然后将学到的表示应用于下游任务。

论文摘要:Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.

语音中的自监督学习包括在大规模未标注的语音语料库上训练一个语音表示网络,然后将学到的表示应用于下游任务。由于SSL语音学习的大多数下游任务主要集中在语音的内容信息上,最理想的语音表征应该能够从内容中分离出不需要的变化,如说话人的变化。然而,分离说话者是非常具有挑战性的,因为去除说话者信息很容易导致内容的损失,而后者的损失通常远远超过前者的好处。在本文中,我们提出了一种新的SSL方法,可以在不严重损失内容的情况下实现说话人的分离。我们的方法改编自HuBERT框架,并结合了拆分机制来规范“老师”标签和学习的表征。我们在一组与内容相关的下游任务中评估了说话人分离的好处,并观察到我们的说话人分离表征具有一致和明显的性能优势。

论文:Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation

论文标题:Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation

论文时间:CVPR 2022

所属领域:医疗AI

对应任务:Medical Image Segmentation,Semantic Segmentation,Semi-supervised Medical Image Segmentation,Volumetric Medical Image Segmentation,医学图像分割,语义分割,半监督式医学图像分割,容积式医学图像分割

论文地址https://arxiv.org/abs/2206.09293

代码实现https://github.com/jianf-wang/gbdl

论文作者:JianFeng Wang, Thomas Lukasiewicz

论文简介:Secondly, in fact, they are only partially based on Bayesian deep learning, as their overall architectures are not designed under the Bayesian framework./其次,事实上,它们只是部分基于贝叶斯深度学习,因为它们的整体架构并非在贝叶斯框架下设计。

论文摘要:Recently, several Bayesian deep learning methods have been proposed for semi-supervised medical image segmentation. Although they have achieved promising results on medical benchmarks, some problems are still existing. Firstly, their overall architectures belong to the discriminative models, and hence, in the early stage of training, they only use labeled data for training, which might make them overfit to the labeled data. Secondly, in fact, they are only partially based on Bayesian deep learning, as their overall architectures are not designed under the Bayesian framework. However, unifying the overall architecture under the Bayesian perspective can make the architecture have a rigorous theoretical basis, so that each part of the architecture can have a clear probabilistic interpretation. Therefore, to solve the problems, we propose a new generative Bayesian deep learning (GBDL) architecture. GBDL belongs to the generative models, whose target is to estimate the joint distribution of input medical volumes and their corresponding labels. Estimating the joint distribution implicitly involves the distribution of data, so both labeled and unlabeled data can be utilized in the early stage of training, which alleviates the potential overfitting problem. Besides, GBDL is completely designed under the Bayesian framework, and thus we give its full Bayesian formulation, which lays a theoretical probabilistic foundation for our architecture. Extensive experiments show that our GBDL outperforms previous state-of-the-art methods in terms of four commonly used evaluation indicators on three public medical datasets.

最近,一些贝叶斯深度学习方法被提出用于半监督的医学图像分割。尽管它们在医学基准上取得了可喜的成果,但仍然存在一些问题。首先,它们的整体架构属于判别模型,因此,在训练的早期阶段,它们只使用标记的数据进行训练,这可能使它们对标记的数据过度拟合。其次,事实上,它们只是部分基于贝叶斯深度学习,因为它们的整体架构不是在贝叶斯框架下设计的。然而,在贝叶斯视角下统一整体架构可以使架构具有严谨的理论基础,使架构的每一部分都能有明确的概率解释。因此,为了解决这些问题,我们提出了一个新的生成贝叶斯深度学习(GBDL)架构。GBDL属于生成模型,其目标是估计输入医疗体的联合分布和其相应的标签。估算联合分布隐含了数据的分布,因此在训练的早期阶段可以同时利用有标签和无标签的数据,这就缓解了潜在的过拟合问题。此外,GBDL完全是在贝叶斯框架下设计的,因此我们给出了其完整的贝叶斯表述,这为我们的架构奠定了理论上的概率论基础。广泛的实验表明,我们的GBDL在三个公共医疗数据集上的四个常用评价指标方面优于以前的最先进方法。

论文:Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets

论文标题:Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets

论文时间:23 Jun 2022

所属领域:机器学习

论文地址https://arxiv.org/abs/2206.11925

代码实现https://github.com/rajesh-lab/deep_permutation_invariant

论文作者:Lily H. Zhang, Veronica Tozzo, John M. Higgins, Rajesh Ranganath

论文简介:However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep./然而,我们表明,现有的互变不变架构,即Deep Sets和Set Transformer,在其深度时可能会出现梯度消失或爆炸的情况。

论文摘要:Permutation invariant neural networks are a promising tool for making predictions from sets. However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep. Additionally, layer norm, the normalization of choice in Set Transformer, can hurt performance by removing information useful for prediction. To address these issues, we introduce the clean path principle for equivariant residual connections and develop set norm, a normalization tailored for sets. With these, we build Deep Sets++ and Set Transformer++, models that reach high depths with comparable or better performance than their original counterparts on a diverse suite of tasks. We additionally introduce Flow-RBC, a new single-cell dataset and real-world application of permutation invariant prediction. We open-source our data and code here: https://github.com/rajesh-lab/deep_permutation_invariant.

置换不变神经网络是一个很有前途的工具,用于从集合中进行预测。然而,我们表明,现有的置换不变架构,即Deep Sets和Set Transformer,在其深度时可能会受到梯度消失或爆炸的影响。此外,层规范,即Set Transformer中选择的规范化,可以通过移除对预测有用的信息来损害性能。为了解决这些问题,我们引入了等价残差连接的清洁路径原则,并开发了集合规范,一种为集合定制的规范化。有了这些,我们建立了Deep Sets++和Set Transformer++,这些模型达到了很高的深度,在一系列不同的任务上,其性能与原来的同类模型相当或更好。我们还介绍了Flow-RBC,一个新的单细胞数据集和置换不变预测的实际应用。我们在这里开放了我们的数据和代码:https://github.com/rajesh-lab/deep_permutation_invariant

论文:DDPM-CD: Remote Sensing Change Detection using Denoising Diffusion Probabilistic Models

论文标题:DDPM-CD: Remote Sensing Change Detection using Denoising Diffusion Probabilistic Models

论文时间:23 Jun 2022

所属领域:计算机视觉

对应任务:Change Detection,Denoising,变化检测,降噪

论文地址https://arxiv.org/abs/2206.11892

代码实现https://github.com/wgcban/ddpm-cd

论文作者:Wele Gedara Chaminda Bandara, Nithin Gopalakrishnan Nair, Vishal M. Patel

论文简介:Human civilization has an increasingly powerful influence on the earth system, and earth observations are an invaluable tool for assessing and mitigating the negative impacts./人类文明对地球系统的影响越来越大,而对地观测是评估和减轻负面影响的宝贵工具。

论文摘要:Human civilization has an increasingly powerful influence on the earth system, and earth observations are an invaluable tool for assessing and mitigating the negative impacts. To this end, observing precisely defined changes on Earth's surface is essential, and we propose an effective way to achieve this goal. Notably, our change detection (CD)/ segmentation method proposes a novel way to incorporate the millions of off-the-shelf, unlabeled, remote sensing images available through different earth observation programs into the training process through denoising diffusion probabilistic models. We first leverage the information from these off-the-shelf, uncurated, and unlabeled remote sensing images by using a pre-trained denoising diffusion probabilistic model and then employ the multi-scale feature representations from the diffusion model decoder to train a lightweight CD classifier to detect precise changes. The experiments performed on four publically available CD datasets show that the proposed approach achieves remarkably better results than the state-of-the-art methods in F1, IoU, and overall accuracy. Code and pre-trained models are available at: https://github.com/wgcban/ddpm-cd

人类文明对地球系统的影响越来越大,而地球观测是评估和减轻负面影响的宝贵工具。为此,观察地球表面精确定义的变化是至关重要的,我们提出了一个有效的方法来实现这一目标。值得注意的是,我们的变化检测(CD)/分割方法提出了一种新的方法,通过去噪扩散概率模型,将数以百万计的现成的、无标签的、通过不同的地球观测项目获得的遥感图像纳入训练过程。我们首先通过使用预先训练好的去噪扩散概率模型来利用这些现成的、未经整理的和未标记的遥感图像的信息,然后采用来自扩散模型解码器的多尺度特征表示来训练一个轻量级的CD分类器来检测精确的变化。在四个公开的CD数据集上进行的实验表明,所提出的方法在F1、IoU和总体准确率方面取得了明显优于最先进方法的结果。代码和预训练的模型可在以下网站获得:https://github.com/wgcban/ddpm-cd

论文:DeepNet: Scaling Transformers to 1,000 Layers

论文标题:DeepNet: Scaling Transformers to 1,000 Layers

论文时间:1 Mar 2022

所属领域:计算机视觉

论文地址https://arxiv.org/abs/2203.00555

代码实现https://github.com/microsoft/unilm , https://github.com/labmlai/annotated_deep_learning_paper_implementations , https://github.com/facebookresearch/xformers , https://github.com/lucidrains/RETRO-pytorch

论文作者:Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei

论文简介:In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers./在本文中,我们提出了一种简单而有效的方法来稳定极深的Transformer。

论文摘要:In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

在本文中,我们提出了一种简单而有效的方法来稳定极深的Transformer。具体来说,我们引入了一个新的归一化函数(DeepNorm)来修改Transformer中的残余连接,并伴随着理论上的初始化。深入的理论分析表明,模型的更新可以以一种稳定的方式被约束。所提出的方法结合了两个世界的优点,即Post-LN的良好性能和Pre-LN的稳定训练,使DeepNorm成为首选。我们成功地将Transform扩展到1000层(即2500个注意和前馈网络子层)而没有任何困难,这比以前的深度Transform要深一个数量级。值得注意的是,在一个具有7482个翻译方向的多语言基准上,我们具有3.2B参数的200层模型明显优于具有12B参数的48层最先进模型5个BLEU点,这表明了一个有希望的扩展方向。

论文:Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

论文标题:Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

论文时间:CVPR 2022

所属领域:语音

对应任务:Speech Enhancement,语音增强

论文地址https://arxiv.org/abs/2203.17263

代码实现https://github.com/facebookresearch/facestar

论文作者:Karren Yang, Dejan Markovic, Steven Krenn, Vasu Agrawal, Alexander Richard

论文简介:Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts./由于唇部动作等面部动作包含有关语音内容的重要信息,因此,视听语音增强方法比纯音频的方法更准确就不足为奇了。

论文摘要:Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts. Yet, state-of-the-art approaches still struggle to generate clean, realistic speech without noise artifacts and unnatural distortions in challenging acoustic environments. In this paper, we propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals. Given the importance of speaker-specific cues in speech, we focus on developing personalized models that work well for individual speakers. We demonstrate the efficacy of our approach on a new audio-visual speech dataset collected in an unconstrained, large vocabulary setting, as well as existing audio-visual datasets, outperforming speech enhancement baselines on both quantitative metrics and human evaluation studies. Please see the supplemental video for qualitative results at https://github.com/facebookresearch/facestar/releases/download/paper_materials/video.mp4.

由于唇部动作等面部动作包含有关语音内容的重要信息,因此,视听语音增强方法比纯音频的对应方法更准确也就不足为奇了。然而,最先进的方法仍然难以在具有挑战性的声学环境中产生干净、真实的语音,而没有噪音伪影和不自然的失真。在本文中,我们为AR/VR中的高保真电信提出了一个新颖的视听语音增强框架。我们的方法利用视听语音线索来生成神经语音编解码器的代码,从而能够从噪声信号中有效地合成干净、真实的语音。鉴于说话人特定线索在语音中的重要性,我们专注于开发对个别说话人效果好的个性化模型。我们在一个新的视听语音数据集上证明了我们的方法的功效,该数据集是在无约束的大词汇量环境下收集的,也包含现有的视听数据集,在定量指标和人类评价研究上都优于语音增强基线。请看补充视频,了解结果:https://github.com/facebookresearch/facestar/releases/download/paper_materials/video.mp4

论文:Patches Are All You Need?

论文标题:Patches Are All You Need?

论文时间:24 Jan 2022

所属领域:计算机视觉

对应任务:Image Classification,图像分类

论文地址https://arxiv.org/abs/2201.09792

代码实现https://github.com/tmp-iclr/convmixer , https://github.com/locuslab/convmixer , https://github.com/labmlai/annotated_deep_learning_paper_implementations/tree/master/labml_nn/conv_mixer , https://github.com/BR-IDL/PaddleViT/tree/develop/image_classification , https://github.com/martinsbruveris/tensorflow-image-models

论文作者:Asher Trockman, J. Zico Kolter

论文简介:Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet./尽管ConvMixer很简单,但我们表明,在类似的参数数和数据集大小下,ConvMixer的性能优于ViT、MLP-Mixer和它们的一些变体,此外还优于ResNet等经典的视觉模型。

论文摘要:Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.

虽然卷积网络多年来一直是视觉任务的主流架构,但最近的实验表明,基于变换器的模型,特别是视觉变换器(ViT),在某些情况下可能超过其性能。然而,由于Transformer中自我注意层的二次运行时间,ViTs需要使用补丁嵌入,将图像的小区域组合成单一的输入特征,以便应用于较大的图像尺寸。这就提出了一个问题。ViTs的性能是由于固有的更强大的Transformer架构,还是至少部分是由于使用补丁作为输入表示?在本文中,我们为后者提供了一些证据:具体来说,我们提出了ConvMixer,一个极其简单的模型,它与ViT和更基本的MLP-Mixer的精神相似,因为它直接以斑块为输入,分离了空间和通道维度的混合,并在整个网络中保持同等大小和分辨率。然而,相比之下,ConvMixer只使用标准的卷积来实现混合步骤。尽管ConvMixer很简单,但我们表明ConvMixer的性能优于ViT、MLP-Mixer和它们的一些变种,在类似的参数数量和数据集大小下,除了优于经典的视觉模型,如ResNet。我们的代码可在https://github.com/locuslab/convmixer

我们是 ShowMeAI,致力于传播AI优质内容,分享行业解决方案,用知识加速每一次技术成长!点击查看 历史文章列表,在公众号内订阅话题 #ShowMeAI资讯日报,可接收每日最新推送。点击 专题合辑&电子月刊 快速浏览各专题全集。

「其他文章」