Modern digital marketing is a competitive field where marketers need to take advantage of any opportunity which might give them an edge on competition. Scaling of content production has become a huge task for marketers since the online positioning of a company is just as important, if not more so, than its offline presence. Managing multiple social media accounts and websites can become very time consuming and tedious, while the importance of social media and its influence on sales grows every day. With the use of machine learning and NLP approach, we have developed a system capable of producing human-like social media captions in seconds.
Our client is a digital and social media marketing expert, which provides its customers with tools for improving their marketing efforts. Scaling of social media marketing efforts is one of the most relevant tasks for any digital marketing campaign, since post frequency and level of engagement are two of the most relevant factors when it comes to the success of the marketing campaign on any social media. Automatic generation of captions solves both of these problems. The texts are generated within seconds, while the texts themselves are created to maximise engagement.
We were tasked with the development of a system which would take an image as input and provide a social media caption as output. The text had to not only match the content of the image, but also be indistinguishable from that written by a human. The ‘human-like’ aspect of the generated texts was of the utmost importance - users can easily identify AI generated text which reduces the engagement and customer trust.
The process of generating a text based on an image consists of two stages:
Our previous experience with image recognition in a number of projects helps us to quickly develop a the app’s architecture:
The first step of generating a social media caption is to extract rich information from the image. For text generation purposes, the app needed to detect objects present in the image and classify them. We have used Azure Computer Vision services to process the images as using an out-of-the-box solution is both cost efficient for the client and sufficient in terms of quality for caption generation purposes. The service takes an image as input and generates a JSON file with the recognised objects and a degree of confidence. If the confidence degree is high enough, we consider that the described object is present in the image, after which the list of “high confidence” objects is passed down to a text generation module.
The main goal of this module is the generation of a caption which would be indistinguishable from that written by a human. In order to achieve high quality texts which would mimic social media posts written by real users, we have developed a machine learning model which was trained using a large dataset of captioned images provided by our client.
The finalised model was then used to generate captions based on a list of objects passed down from the image recognition module. The caption generation then goes as follows:
The model which we have developed generates social media captions which are indistinguishable from those written by real users in tone and grammatical structure. The model embellishes the generated text with appropriate emojis and hashtags to further imitate what an average social media user would write under an image.
In fact, the model even makes the same mistakes as an average user of social media, like misspelling or using a wrong word. The dataset of captions used to train the model included captions with mistakes in them, which the model also replicates from time to time. This serves as a testament to how well it replicates the real captions.