Fine-tuning OpenAI's GPT-2 to generate fake DJ biographies
For fun, I wanted to see if I could fine-tune OpenAI’s GPT-2 model to generate DJ and electronic musician biographies.
Although it’s only my opinion, it performed impressively. With only a modest amount of training, the model’s output gets close to the source material. Even when it misses, it’s at least like an engaging stream-of-consciousness parody of it.
GPT-2 is still seen as the state-of-the-art for generating coherent passages of human-looking text. After it was made available in February 2019, people quickly set to work applying it to a wide range of topics. It was even recently applied to music generation.
Language generation like this reminds us how predictable human output can be. Look at the PoMo text generator for an example of how a particular field can fall into a formula that becomes easy to spoof.
DJ biographies are the same to some degree. The artists often put a lot of effort into writing them as part of projecting their professional brand and market themselves. These bios tend to follow characteristic patterns, with recurring language usage and structure. They vary from professionally written to hastily composed by a non-native speaker, and as such offer an interesting training set for fine-tuning the GPT-2 model, which will take influences from all of them and create a composite of them.
Here is a biography I generated with GPT-2:
Auraum is a new project by Alessio Viggiano and Alessio Stumpvane, who are residents in the underground club NorthEast Rome. They have decided to combine their talents to offer a new vision of the music. Something more underground, more raw and more experimental, aiming to create a unique sound. Auraum strives to reach the right balance between the dance floor and the lounge.
I doubt I’d be able to spot this as a fake if it were placed among real ones. ‘Auraum’ does not appear in the training data but nonetheless GPT-2 has decided it sounds appropriate (and I concur). A real world producer, Alessio Vigganio was then paired with another GPT-2 invention, ‘Alessio Stumpvane’, before the model conjured up the rest of the biography.
To me, it’s impressive that a machine can create this from a pre-existing model and a small amount of subsequent training on new material. I’ve shared more examples later on. I’ve also linked to the fine-tuned model, along with the Google Colab notebook, so that you generate your own.
Setting up GPT-2
GPT-2 is a pre-trained model built for generating text. Transfer learning can be applied to fine-tune it for a particular domain, which involves giving it a representative data to update the parameters of some parts of the architecture, while overall keeping most of it the same.
A language model predicts the next word when given a sequence of preceding ones. We want transfer learning to update the vocabulary and the likelihoods involved in that prediction to reflect the new training data. It’s important that a model should imitate but not replicate the training data. If the model does reproduce the training data then that is a sign of overfitting, and the effectiveness of the model is undermined.
The code used here for fine-tuning GPT-2 has been provided by Max Woolf, building on work by Neil Sheppard to produce gpt2-simple and was kind enough to put it in a notebook on Google Colab, which is a free-to-use cloud-based service that enables use of GPUs at no cost.
OpenAI released three versions of GPT-2 models: “small” (124M parameter model, 500MB on disk), “medium” (355M model. 1.5GB on disk), and “large” (774M model, 3GB on disk). The larger the model, the more sophisticated the output. Sadly, the largest model cannot currently be tuned at all using Google Colab since the server GPUs will go out-of-memory during training.
It’s worth keeping in mind that the results here are based on the medium model and the results would likely be better with the largest one.
Getting data
As a general rule, the larger the model, the more data is needed. Woolf recommended using the small 124M model when <10MB data is available. Scraping for training data is a legally grey area. However, a strong argument can be made that AI-generated content is a transformative derivative work, akin to a human reading everything and synthesising something new from their understanding of the source material. It is not a reproduction of the original material but rather a genuine creative process that produces something new from an initial step of learning and comprehension (overfitting withstanding).
I put together a python script using the Beautiful Soup library to collect and parse the HTML for each feature artist at a very slow pace to ensure the source website was unaffected by the process. The self-imposed speed restrictions meant that this process took several days to completely collect the raw corpus of text data.
Because GPT-2 has been trained to generate English content only, I used the Spacy.io langdetec module to classify each text’s language and filtered out anything unlikely to be English. I also added delimiters to the beginning and end of each biography (<|startoftext|> and <|endoftext|>) to firmly disambiguate them for the gpt2-simple training process. This resulted in ~52,000 biographies to be used for fine-tuning, at about 74MB in total,
Fine-tuning the model
I initially ran the gpt-simple script using a sample of data on the 124M model to check if it worked and the output made sense. I didn’t keep any of that to show you but at around 12,000 training steps, judging the output by eye, it looked pretty good. I switched up to the medium 355M model with the same data sample to see how that went, which took around 20k steps to begin to take on the appearance of a plausible biography. I then decided to move to training the medium model with the full data set.
Periodically, I checked the test output of this final run with full data with the 355M model every 500 steps to see how it looked. It took about 20k steps before it began to approach a rough appearance of a biography with some kind of regularity. Roughly every 5k of training steps there seemed to be an incremental improvement in the sophistication of output.
At around 35k training steps the output began to move from total gibberish to something more like human text:
Make the best decision to save the planet by saving the human race! Every life is a let-down, every time we make a mistake, every piece of paper is a failure. Tomorrow is different and the people we honour are different. This is why we have to be strong and proud and why our best is still our strongest.
The only thing that should be in the future is to save the planet. Quoting the godfather of house music “Get right with the music, get right with the crowd”
This indicated that the fine-tuning was going in the right direction. An important check throughout training was observing the behaviour around gender pronouns. The training data appeared to have a strong bias to he/him, which in turn creates a bias in the model. I could only judge by eye, but I sought for output that used she/her appropriately instead of he/him in all cases. I concluded training when the model did begin to behave better in this regard. Bias in GPT-2 is a known problem. In general, bias in AI has a lot of ethical and practical issues to be worked out yet.
I concluded training at 50k steps. In my judgment, the model produced ‘good enough’ results without obvious overfitting. Further training seemed wasteful, particularly as it appeared at the time that Colab was not entirely 100% carbon neutral and I sought to minimise my eco-guilt.
Results
In his post about creating a twitter bot, Woolf states that he estimates only about 5% of his AI-generated tweets are good/funny. I’m generating longer texts for a different purpose, and can’t really put a number on how effective I feel the model performs. But in general, I feel satisfied and impressed by the performance of it. Everyone’s mileage varies, and I provide examples of the unfiltered output later so you can gauge for yourself.
Here’s an example of the model working at its best:
Based in Berlin, Germany, Aida is an artist who is influenced by a multitude of music genres, but her passion for electronic music, in all its forms, transcends all musical genres. Aida’s DJ sets are a mix of progressive melodies, hypnotic techno, and a lot of vocals. She is inspired by a multitude of genres, but her passion for electronic music, in all its forms, transcends all musical genres.
Aida is a resident DJ at the renowned Watergate Club, and she is a regular guest at various Berlin venues such as Loftus Hall, KitKat Club, and Golden Gate.
She has performed alongside international artists such as: Dan Ghenacia, Marcus Worgull, Kas:st, and Moritz von Oswald.
She is also a resident of the Berlin based Dense & Pika collective.
For booking requests please contact: [email protected]
It follows the general style and structure of a typical bio, even including the closing protected email string. It uses the names of real Berlin venues, after it mentions the city at the beginning of the bio. It also features renowned German figures like Marcus Worgull and Moritz von Oswald. However, the rest of the bio is entirely fictional and not regurgitated from the training data. It’s worth noting that while this is an example of a bio with she/her pronouns, that isn’t the case most of the time. Overwhelmingly, the output is biased to he/him, judging by eye, mirroring the training data.
Here is another example:
Dj and producer Paze Code is based in Germany, he always brings the freshest sounds and best with him, playing in several countries like Belgium, France, the Netherlands, Switzerland, Netherlands, and Germany. Paze Code is also label boss of Le Bien Et Le Mal.
http://soundcloud.com/paze-code
http://www.facebook.com/pages/Le-Bien-Les-Maloids/2445430261306?ref=ts
The URLs, along with the name “Paze Code”, are all made up by GPT-2 and it’s impressive how it keeps track of the entities to build up the narrative. You can see that it has trouble again with lists, mentioning Netherlands twice, but otherwise it isn’t egregious.
We can also give the model a few words as a prompt, after which it generates text in continuation. Here is what it generated for a real DJ, Stella Zekri, given as a prompt:
Stella Zekri is an underground music producer and dj based in Germany. She started her career in 2011. She started playing in a few clubs in Germany such as Baum, Butan, Muna and many more.
Stella Zekri is also an musician and she also founded the female dj duo “Skattwerk” with her friend and dj Anne Maier. Skattwerk toured all over Germany and Europe.
Stella Zekri is also one of the resident djs in the legendary club “Skull” in Berlin.
In 2010 she started producing her own music and she released her first EP on the german Label “Hippies” Leipzig.
After some years she also founded the female dj duo “Skull” and she released her first album on “Hippies” Leipzig.
After a long break she finally decided to come back on stage. In September 2018 Stella will release her new single “I Need” on the german Label “Eisenwaren”.
Stella Zekri is a member of the female dj duo “Skull” and she played in many clubs in Germany such as Baum, Butan, Muna and many more.
It certainly wouldn’t win awards for the best writing, but then I would argue that many actual English-speaking humans couldn’t better it either, not without substantial effort anyway. It raises the question of who we compare these models to - professional writers or humans with a wide spectrum of abilities? I’ve linked to the whole batch of bios I generated for Stella here. Notice that the model connects ‘Stella’ with she/her throughout all of the output. If you want to compare all of this to her real bio, here it is.
Next, let’s look at DJ Fart in the Club (yes, that is a real DJ):
DJ Fart In The Club (aka DJ Fart Cuts) is a House DJ from Chicago, IL. His DJ style is a fusion of Tribal, Tech House, and Techno.
DJ Fart Cuts started DJing in the early 90s in his home town of Chicago, IL. He was introduced to house music in 1996 and has been hooked ever since. His musical influences include Prince, Kenny Dope, and David Morales.
He currently plays at various Chicago venues such as Crobar, Smart Bar, Big Wig, Lava, etc. Internationally he has played at the Winter Music Conference in Miami and the Movement Festival in Detroit.
His musical style is a fusion of Tribal, Tech House, and Techno.
“I love that house music has a lot of soul. It has a lot of emotion. It’s got a lot of great elements that I just love.” -DJ Fart Cuts
You can see that it defaults the pronouns to he/him/his, while DJ Fart in the Club uses she/her, following the bias in the training data. Here is the rest of the batch generated. In there, you can see the model has fun experimenting with the word fart, and it’s reassuring that AI can appreciate toilet humour too. No bio exists for comparison though. Overall, I think the generated bios are far from terrible, and if you share my sense of humour, they can be pretty funny.
Finally, I entered the DJ aptly named “I Hate Models” as a prompt, and this seemingly tipped GPT-2 over the edge into full rant mode:
I HATE MODELS.
I HATE PRODUCTIONS.
I HATE GARAGE PARTIES.
I HATE FASHION.
I HATE GARAGE BANGS.
I HATE FLASH GIGS.
I HATE GARAGE PARTIES.
I HATE HAPPY PEOPLE.
I HATE LUV.
I HATE LUXURIES.
I HATE CRUX.
I HATE COFFE.
I HATE LUXURIES.
I HATE BEING ABLE TO BE.
I HATE EXPECTATIONS.
I HATE DIFFICULT.
I HATE OVERTHROW.
I HATE OLDER.
I HATE DISCO.
I HATE EXPECTATIONS.
I HATE COLLECTIVES.
I HATE CULTURE.
I HATE FAMILY.
I HATE DANCING.
I HATE DISCO.
I HATE LIBERATE.
I HATE DISCO.
I HATE MIND.
I HATE BEING FAMOUS.
I HATE EXPERIMENT.
I HATE CRUX.
I HATE GARAGE.
I HATE PARTIES.
I HATE PRODUCTIONS.
I HATE PRODUCTIONS.
I HATE PRODUCTIONS.
I HATE GARAGE PARTIES.
I HATE FAMOUS.
I HATE EXPECTATIONS.
The rest of its output for this batch is equally unhinged (although I believe mistakenly capitalising the name is partly responsible for the general angry tone).
I’ve made available for download a file containing 500 GPT-2 bios that were bulk generated for anyone interested to take a look at. I’ve also shared the fine-tuned model file and my Colab notebook so you can generate bios yourself. The model is about 1GB in size so be prepared for that. You need to put the file in your Google Drive and do not rename it for the code to work.
The end
This was all just done for fun but the potential of these models is clear. Keep in mind that the output was not produced by the best available model, but only the ‘medium’ one, and that just a modest amount of training was applied too.
Teams are already building similar models with a greater level of control, which could enable serious commercial applications. I do wonder what kind of effect this technology will have on some writing careers in the not so distant future.
I’ll probably return with more applications of this kind of stuff. In the meantime, take a look at OpenAI’s Jukebox for how neural nets are now being applied to music and songs, including this timeless classic.