我的消费券

Some companies are wary about using a single cloud vendor, or using managed services that can be hard to quit. Spotify has made a big bet in the other direction.
一些公司对使用单个云供应商或使用难以退出的托管服务持谨慎态度。Spotify在另一个方向上下了大赌注。
By Tom Krazit  May 3, 2021
Count Spotify among the businesses that are quite happy to be tied down to a single cloud provider.

Arguably one of Google Cloud's highest-profile customers, Spotify began a migration to Google five years ago when Google's long-term commitment to cloud computing was something of a question. Ever since then, the music and podcast service has doubled down on Google's infrastructure, building around higher-level services that trade convenience and ease of use for effortless portability.

And that's just fine with Tyson Singer, vice president of technology and platforms at Spotify, who oversees the technical infrastructure that serves Spotify's 356 million monthly active users. The 2,000-plus developers and tech professionals at Spotify also have a secret weapon called Backstage, an internally-developed management console that allows developers to use the dozens of tools in Spotify's arsenal through a consistent user interface. Backstage is available as an open-source project through the Cloud Native Computing Foundation.

在非常高兴将自己捆绑到单个云提供商的企业中，算上Spotify。

Spotify可以说是Google Cloud最受瞩目的客户之一，五年前开始迁移到Google，当时Google对云计算的长期承诺是一个问题。从那时起，音乐和播客服务就在Google的基础架构上翻了一番，围绕高级服务构建，这些服务以便利性和易用性为代价，实现了轻松的可移植性。

Spotify技术与平台副总裁Tyson Singer很好，他负责管理为Spotify每月3.56亿活跃用户提供服务的技术基础架构。Spotify的2,000多名开发人员和技术专业人员还拥有一个称为Backstage的秘密武器，这是一种内部开发的管理控制台，允许开发人员通过一致的用户界面使用Spotify军械库中的数十种工具。可以通过Cloud Native Computing Foundation作为开源项目使用Backstage 。
In a recent interview with Protocol, Singer discussed the company's decision to marry its fortunes to Google Cloud, the pros and cons of using managed services and why "ML ops" is the next big thing on his radar.

This interview has been edited and condensed for clarity.

A few years ago, Spotify made a pretty substantial migration to Google Cloud. Where do things stand at this point?

We are all in on GCP. That was a really intentional approach that we took a number of years ago to get ourselves out of the commodity [infrastructure management] job, and all of the attention that it was taking from our organization, to focus on higher-level things. And we did it, I think, a little bit differently than a lot of companies that you see.

So if you were to compare us to Netflix, what we did is we went all in on a single vendor, but we also went all in on these high-level managed services. That was an approach that sort of doubled down on this whole philosophy that we wanted to spend more time focused on our business, and less time on infrastructure. That's an interesting thing for somebody who leads the infrastructure organization to say, but it's something that I actually truly believe in.

The other driver was really just speed. It's an organization that is oriented around speed. And in my organization, our mantra is that we're enabling speed, scale, and doing it safely for basically every Spotifier and all of our products.
在最近对Protocol的采访中，Singer讨论了该公司将其财富与Google Cloud结合起来的决定，使用托管服务的利弊，以及“ ML ops”为何成为他的下一个热门话题。

为了清楚起见，本次采访已被编辑和压缩。

几年前，Spotify向Google Cloud进行了相当大的迁移。此时的情况如何？

我们都参与了GCP。那是几年前我们采取的一种真正有意的方法，使我们脱离了商品[基础结构管理]的工作，并且从组织中转移了所有注意力，将精力集中在更高层次的事情上。我认为我们所做的与您看到的许多公司有所不同。

因此，如果您要将我们与Netflix相比较，我们所做的就是将全部精力集中在一个供应商上，但是我们也将全部精力放在这些高级托管服务上。这是在整个哲学上加倍的一种方法，我们希望将更多的时间花在业务上，而将更少的时间花在基础架构上。对于领导基础架构组织的人来说，这是一件有趣的事情，但这是我真正相信的事情。

另一个驱动程序实际上只是速度。这是一个以速度为导向的组织。在我的组织中，我们的口头禅是我们为每个Spotifier和我们所有的产品实现速度，扩展和安全地做到这一点。
Why Google?

We did the usual sort of due diligence that everybody does when looking at a cloud vendor. But what really stood out for Google was a few things.

One, they were leading on the data side. And we realized, based on the amount of data that we were ingesting, that we needed a partner who could handle complexity and scale and data, and get us beyond the limits that we were having. We had the largest Hadoop cluster running in Europe at the time, but it was still quite constrained for our organization.

And then second, we needed a partner that fit with us culturally, and that we felt like we could really influence compared to some of the other possibilities at that time. Google definitely hit both of those criteria because they were new entrants, and they had a lot of the same sort of cultural aspects that we had around autonomy and independence in our engineering team and just really focus on engineering excellence.

It's funny, because I've heard, even from people at Google, that one of the things they've struggled with is trying to empathize with their customer, trying to understand that not every customer needs a Google-scale approach to what they do. But it almost sounds like for you that's what was appealing.

It was and it wasn't. We did have conflicts at times, as in any good partnership. They've been on a learning journey of how to have empathy for customers, and what their specific requirements are that might not be the Google Way, and understand that there are these other sorts of amazing engineering organizations that may do things differently, and those make sense.

为什么选择Google？

我们做了每个人在寻找云供应商时都进行的通常的尽职调查。但是，真正让Google脱颖而出的是几件事。

第一，他们在数据方面处于领先地位。而且，我们意识到，根据我们正在提取的数据量，我们需要一个能够处理复杂性，规模和数据并使我们超越现有限制的合作伙伴。当时，我们拥有欧洲最大的Hadoop集群，但对于我们的组织而言，它仍然受到很大的限制。

其次，我们需要一个在文化上适合我们的合作伙伴，并且与当时的其他一些可能性相比，我们觉得我们确实可以产生影响。Google肯定符合这两个条件，因为它们是新进入者，而且它们在工程团队中具有很多与自主性和独立性有关的文化方面，而实际上只是专注于卓越的工程。

这很有趣，因为即使从Google员工那里我也听说过，他们一直在努力的一件事是试图同情客户，试图理解并不是每个客户都需要Google规模的方法来做他们的工作。但这听起来对您来说确实很有吸引力。

是，不是。就像在任何良好的伙伴关系中一样，我们有时确实存在冲突。他们一直在学习如何同情客户，他们的具体要求（可能不是Google Way）的学习过程，并且了解到还有其他各种出色的工程组织可能会做不同的事情，有道理。
So it was a good journey. And since we were quite a large account of theirs, we were able to help them go on that journey as well.

I wanted to circle back on the managed services question, which I think is really interesting. I interviewed Mike McNamara from Target a month ago, and he was the complete opposite: I don't want any managed services, I want to run everything ourselves, we understand that we have to invest in people and skills in order to do that, but we want that flexibility.

Can you talk a little bit about your philosophy behind that? The pros and cons of building around managed services that really puts you in bed with Google for a very long time?

There's a part of the story that has changed recently. Going back to a couple things that I said before, which is we, first and foremost, are optimizing for speed. And secondly, we're constrained in our data ecosystem.

So when you look at a product like BigQuery, and the accessibility of that and the scalability of that — and also the complexity of building something like that yourself — there's a huge appeal in that.

We actually tracked the usage of BigQuery across the company, relative to our previous context, and the number of employees — especially technically-savvy employees that didn't have the data skills — we just saw this crazy exponential curve of adoption of that sort of technology. So that's like, "Alright, yes, we're extracting business insight at such a faster rate." And that was what was most important for us at the time.

因此，这是一段美好的旅程。而且由于我们对他们的考虑很大，因此我们也能够帮助他们继续前进。

我想回头谈一下托管服务问题，我认为这很有趣。一个月前，我采访了Target的Mike McNamara，但他的看法恰恰相反：我不需要任何托管服务，我想自己经营一切，我们知道我们必须投资于人员和技能来做到这一点，但是我们想要这种灵活性。

您能谈谈背后的哲学吗？围绕托管服务进行构建的优缺点真的会让您长期困扰Google吗？

故事的一部分最近发生了变化。回到我之前说过的两件事，那就是我们首先要优化速度。其次，我们受制于数据生态系统。

因此，当您查看BigQuery之类的产品时，该产品的可访问性和可扩展性–以及您自己构建此类产品的复杂性–都具有巨大的吸引力。

实际上，相对于以前的情况，我们跟踪了整个公司中BigQuery的使用情况，以及员工人数（尤其是不具备数据技能的精通技术的员工），我们刚刚看到了采用这种疯狂的指数曲线技术。就像是，“好吧，是的，我们以更快的速度提取业务洞察力。” 那对当时的我们来说最重要。
However, as we adopted more of those managed services, then questions of efficiency start to come in. From the perspective of the customers that I look after — other Spotifiers — we want them to have that level of abstraction, so they don't have to get down into the nitty-gritty details of understanding infrastructure and really be abstracted from that; we are going through and layering in our own managed services so that we can get better scales of efficiency there and better, basically, unit cost.

Just so I'm clear on that, you're sort of building your own Spotify-managed services on top of, let's say, vanilla GCP? Rather than using some of the Google managed services?

Yeah, in very targeted fashion. We're not doing it across the board. We're just doing it where we think it really has an impact on the effectiveness of our overall budget and spend.

Can you talk a little bit about what some of those targeted areas are for you?

Data processing is one of those areas where we've taken a look at and are in process of [building] that. We're very transparent with Google about this. There are some other services that I don't want shared publicly that we're working on as well. But [data processing is] the one that's probably the most visible because we're also doing it in the open-source arena.

但是，随着我们更多地采用这些托管服务，效率问题就开始出现。从我要照顾的客户（其他Spotifier）的角度来看，我们希望他们拥有这种抽象水平，因此他们没有深入了解基础架构的细节，并从中真正地抽象出来；我们正在研究和分层我们自己的托管服务，以便在那里获得更好的效率规模和更好的（基本上）单位成本。

如此一来，我很清楚，您是在香草GCP之上构建自己的Spotify托管服务。而不是使用某些Google托管服务？

是的，非常有针对性。我们并没有全力以赴。我们只是在我们认为确实会对我们的整体预算和支出的效果产生影响的地方这样做。

您能否谈谈一些针对您的目标领域？

数据处理是我们研究并正在构建的领域之一。我们对此与Google非常透明。我们也正在努力开发其他一些我不想公开共享的服务。但是[数据处理]可能是最可见的一种，因为我们也在开源领域中做到这一点。
One of the things that we've also done [is] going from just being completely optimized for speed to saying now we might have a little bit of extra fat in the organization, we need to trim that back on how we spend on the cloud. It's been a journey to really change an engineering culture and mindset that was focused on a lot of important things around performance and scalability, reliability, observability; all those things that engineers love to work on, but they weren't focused on cost.

So we leveraged a tool that we've spent a lot of time on and our engineers love and adore — our development portal called Backstage — to add a plugin into that ecosystem. It is a cost-insight plugin that allowed us to sort of take a step forward in our cloud evolution, so that more and more engineers could understand the implications of their engineering decisions towards the company bottom line in a context that was meaningful for them.

Backstage looks like it could support a multicloud environment. Did you have that in mind when you built it?

We really want Backstage to succeed because it's so integral to how our company operates. It's the single pane of glass that developers, data scientists, sometimes even designers look at do their jobs: to build out the software, to manage the software, to create their software [and] to find new software. It doesn't matter if it's data, like a new data pipeline, a new back-end service or a new feature on mobile, a new machine-learning feature; it's all inside of this context.
我们还要做的一件事是，从完全优化速度到现在说我们在组织中可能会有一些多余的脂肪，我们需要减少在云上的花费方式。这是一次真正改变工程文化和思维方式的旅程，这种文化和观念集中在围绕性能和可伸缩性，可靠性，可观察性的许多重要方面。工程师喜欢从事的所有工作，但他们并没有把重点放在成本上。

因此，我们利用了一个已经花费了很多时间的工具，并且我们的工程师非常喜欢和喜欢它-我们的开发门户网站Backstage-将插件添加到该生态系统中。它是一个成本洞察插件，使我们能够在云计算发展中迈出一步，以便越来越多的工程师可以在对他们有意义的环境中了解他们的工程决策对公司底线的影响。

后台看起来可以支持多云环境。您在构建它时是否想到了这一点？

我们真的希望Backstage成功，因为它对我们公司的运营至关重要。这是开发人员，数据科学家甚至有时甚至是设计师要看的唯一一块玻璃：构建软件，管理软件，创建软件[以及]查找新软件。不管数据是什么，例如新的数据管道，新的后端服务或移动设备上的新功能，新的机器学习功能，都无关紧要。一切都在此上下文中。

Because this is so central to how we do development, we want to share it with the world. We want it to win as the developer portal out there. And therefore, it has to work on more than just GCP, it has to work on AWS, it has to work on [Microsoft] Azure.

But in terms of Spotify, you don't seem really keen on setting up multicloud yourself?

No, not super keen. There's simplicity in having a single cloud, and that saves us a lot of hassle and complexity.

Which emerging enterprise technologies are you most excited about, or which ones do you think could have the most impact on Spotify?

One of the areas where we've been investing for a while has been in [machine learning] ops or ML infrastructure, and stitching together all of our different parts of our solution there. I'm seeing more companies enter this area, which I still feel like is not a well-served area in the overall marketplace. It's not well-served from the cloud providers that don't really stitch together something that supports the sort of full lifecycle. And so I think we're quite close to having it stitched together. But I see a lot of activity in that, and that's actually pretty exciting.
因为这对于我们进行开发至关重要，所以我们希望与世界分享。我们希望它作为开发者门户赢得胜利。因此，它不仅需要在GCP上工作，还必须在AWS上工作，并且必须在[Microsoft] Azure上工作。

但是就Spotify而言，您似乎不太热衷于自己设置多云？

不，不是很热衷。拥有单个云很简单，这为我们节省了很多麻烦和复杂性。

您最兴奋的是哪种新兴企业技术，或者您认为哪些技术可能会对Spotify产生最大的影响？

我们已经投资了一段时间的领域之一是[机器学习] op或ML基础架构，并将我们解决方案的所有不同部分组合在一起。我看到越来越多的公司进入这一领域，但我仍然觉得这不是整个市场中服务良好的领域。云提供商并没有很好地提供服务，这些提供商并没有真正将支持整个生命周期的某种东西捆绑在一起。因此，我认为我们非常接近将其缝合在一起。但是我看到了很多活动，这实际上非常令人兴奋。
How would you define ML ops?

Generally, the challenge is going from the training of models to the runtime ecosystem, and ensuring that you can do all the sort of standard software development practices that we're all used to in all the other disciplines; being able to do CI/CD type activities, and to iterate on your experimentation that you're doing in those ecosystems.

With our infrastructure, there'll be a new model created on a frequent basis. And then we run that through our experimentation platform, and see, "Did that actually move metrics?" Being able to organize all of that and keep that as something that's sustained, and [helping] people who've joined Spotify to really focus on amazing ML research that are currently bogged down, is where I got to see all the different aspects of ML ops.
您将如何定义ML操作？

通常，挑战是从训练模型到运行时生态系统，并确保您可以执行所有其他学科都习惯的各种标准软件开发实践。能够进行CI / CD类型的活动，并可以在这些生态系统中反复进行实验。

利用我们的基础架构，将会经常创建一个新模型。然后，我们在实验平台上运行它，然后看到“实际上是在移动指标吗？” 能够组织所有这些并将其保持不变，并[帮助]加入Spotify的人们真正专注于目前陷入困境的惊人的ML研究，这是我了解ML各个方面的地方行动。

为什么Spotify喜欢被锁定在Google Cloud中