Very Large Scale Music Understanding
BRIAN WHITMAN
The Echo Nest Corporation
Scientists and engineers around the world have been attempting to do the impossible—and yet, no one can question their motives. When spelled out, “understanding music” by a computational process just feels offensive. How can music, something so personal, something rooted in context, culture, and emotion, ever be labeled by an autonomous process? Even an ethnographical approach—surveys, interviews, manual annotation—undermines the raw effort of musical artists, who will never understand, or even, perhaps, take advantage of what might be learned or created through this research. Music by its very nature resists analysis.
In the past 10 years, I’ve led two lives—one as a “very long-tail” musician and artist and the other as a scientist turned entrepreneur who currently sells “music intelligence” data and software to almost every major music-streaming service, social network, and record label. How we got from one to the other is less interesting than what it might mean for the future of expression and what I believe machine perception can actually accomplish.
In 1999, I moved to New York City to begin graduate studies at Columbia working on a large “digital government” grant parsing decades of military documents to extract the meaning of acronyms and domain-specific words. At night I would swap the laptops in my bag and head downtown to perform electronic music at various bars and clubs.
As hard as I tried to keep my two lives separate, the walls between them quickly came down when I began to ask my fellow performers and audience members how they learn about music. They responded, “We read websites,” “I’m on a discussion board,” “A friend e-mailed me some songs,” and so on. Obviously, simultaneously with the concurrent media frenzy on peer-to-peer networks
(Napster was just ramping up), a real movement in music discovery was underway. Technology had been helping us acquire and make music, but all of a sudden it was being used to communicate and learn about it as well.
With the power to communicate with millions and the seemingly limitless potential of bandwidth and attention, even someone like me could be noticed. So, suitably armed with a background in information retrieval and an almost criminal naiveté about machine learning and signal processing, I quit my degree program and began to concentrate full time on the practice of what is now known as “music information retrieval.”
MUSIC INFORMATION RETRIEVAL
The fundamentals of music information retrieval derive from text retrieval. In both cases, you are faced with a corpus of unstructured data. For music, these include time-domain samples from audio files and score data from the compositions. Tasks normally involve extracting readable features from the input and then developing a model from the features. In fact, music data are so unstructured that most music-retrieval tasks began as blind “roulette wheel” predictions: “Is this audio file rock or classical?” (Tzanetakis and Cook, 2002) or “Does this song sound like this one?” (Foote, 1997).
The seductive notion that a black box of some complex nature (usually with hopeful success stories embedded in their names [e.g., neural networks, Bayesian belief networks, support vector machines]) might untangle a mess of audio stimuli in a way that approaches our nervous and perceptual systems’ response is intimidating enough. That problem is so complex and so hard to evaluate that it distracts researchers from the much more serious elephantine presence of the emotional connection that underlies the data.
The science of music retrieval is rocked by a massive advance in signal processing or machine learning that solves the problem of label prediction. We can now predict the genre of a song with 100 percent accuracy. The question is what that does for the musician and what it does for the listener. If I knew a song I hadn’t heard yet was predicted to be “jazz” by a computer, this might save me the effort of looking up the artist’s information, who probably spent years defining his/her expression in terms of or despite these categories. But the jazz label doesn’t tell me anything about the music, about what I’ll feel when I hear it, about how I’ll respond or how it will resonate with me individually or in the global community. In short, we had built a black box that could neatly delineate other black boxes but was of no benefit to the very human world of music.
The way out of this feedback loop is to somehow automatically understand reaction and context the same way we do when we actually perceive music. The ultimate contextual-understanding system would be able to gauge my personal reaction and mindset to a piece of music. It would not only know my history and my influences, but would also understand the larger culture around the musical content.
We are all familiar with the earliest approaches to contextual understanding of music—collaborative filtering, a.k.a. “people who buy this also buy this” (Shardanand and Maes, 1995)—and we are just as familiar with its pitfalls. Sales-or activity-based recommenders only know about you in relation to others—their meaning of your music is not what you like but what you’ve shared with an anonymous hive. The weakness of these filtering approaches becomes apparent when you talk to engaged listeners: “I always see the same bands,” “There’s never any new stuff,” or “This thing doesn’t know me.”
My reaction to the senselessness of filtering approaches was to return to school and begin applying my language-processing background to music—reading about music and not just trying to listen to it. The idea was that, if we could somehow approximate even 1 percent of the data that communities generate about music on the Internet—they review it, they argue about it on forums, they post about shows on their blogs, they trade songs on peer-to-peer networks—we could begin to model large-scale cultural reactions (Whitman, 2005). Thus, in a world of music activity, we would be able to autonomously and anonymously find a new band, for example, that collaborative filtering would never touch (because there are not enough sales data yet) and acoustic filtering would never “get” (because what makes them special is their background or their fan base or something else impossible to calculate from the signal).
THE ECHO NEST
With my co-founder, whose expertise is in musical approaches to signal analysis (Jehan, 2005), I left the academic world to start a private enterprise, “The Echo Nest.” We now have 30 people, a few hundred computers, one and a half million artists, and more than ten million songs. Our biggest challenge has been the very large scale of the data. Each artist has an Internet footprint, on average thousands of blog posts, reviews, and forum discussions, all in different languages. Each song is comprised of thousands of “indexable” events, and the song itself might be duplicated thousands of times in different encodings. Most of our engineering work involves dealing with this huge amount of data. Although we are not an infrastructure company, we have built many unique data storage and indexing technologies as a byproduct of our work.
We began the company with the stated goal of indexing everything about music. And over the past five years we have built a series of products and technologies that take the best and most practical aspects of our music-retrieval dissertations and package them cleanly for our customers. The data we collect are necessarily unique. Instead of storing data on relationships between musicians and listeners, or only on popular music, we compute and aggregate a sort of Internet-scale cache of all possible points of information about a song, artist, release, listener, or event. We sell a music-similarity system that compares two songs based on their acoustic and cultural properties. We provide data (automatically gener-
ated) on tempo, key, and timbre to mobile applications and streaming services. We track artists’ “buzz” on the Internet and sell reports to labels and managers.
The heart of The Echo Nest remains true to our original idea. We strongly believe in the power of data to enable new music experiences. Because we crawl and index everything, we can level the playing field for all types of musicians by taking advantage of the information provided to us by any community on the Internet. Work in music retrieval and understanding requires a sort of wide-eyed passion combined with a large dose of reality. The computer is never going to fully understand what music is about, but if we sample from the right sources often enough and on a large enough scale, the only thing standing in our way is a leap of faith by listeners.
REFERENCES
Foote, J.T. 1997. Content Based Retrieval of Music and Audio. Pp. 138–147 in Multimedia Storage and Archiving Systems II, edited by C.-C.J. Kuo, S-F. Chang, and V. Gudivada. Proceedings of SPIE, Vol. 3229. New York: IEEE.
Jehan, T. 2005. Creating Music by Listening. Dissertation, School of Architecture and Planning, Program in Media Arts and Sciences, Massachusetts Institute of Technology.
Shardanand, U., and P. Maes. 1995. Social Information Filtering: Algorithms for Automating ‘Word of Mouth.’ Pp. 210-217 in Proceedings of ACM (CHI)’95 Conference on Human Factors in Computing Systems. Vol. 1. New York: ACM Press.
Tzanetakis, G., and P. Cook. 2002. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5): 293–302.
Whitman, B. 2005. Learning the Meaning of Music. Dissertation, School of Architecture and Planning, Program in Media Arts and Sciences, Massachusetts Institute of Technology.