Integrating User-Generated Content based on visual and textual components
People increasingly contribute actively and passively data online in the form of social media posts, Wikipedia entries, search engine queries or to Citizen Science Projects (CSPs). This constant stream of User-Generated Content (UGC) has given rise to an entire field of scientific research that tries to leverage this data for various applications such as biodiversity monitoring, land use estimates or mobility studies. Simultaneously to the increasing amount of UGC and people having access to the internet the amount of data sinks (e.g. social media platforms) also increases. From a researcher perspective, that is using UGC, this implies that in an effort to collect data on a specific topic, it is ever more favorable to draw upon multiple UGC sources simultaneously to not only collect more data but also to form a more representative sample. Therefore, compiling an integrated dataset comprised of multiple UGC sources on a specific topic is crucial and addresses many of the common UGC limitations. Representativeness of data as in the share of the population that is included can be enhanced when multiple data sources with their inherent demographic characteristics are considered. Besides increasing the included user base through data integration, the most obvious benefit lies in an increased data volume to improve the overall temporal and spatial coverage.
This project aims at addressing the need for an automated UGC data integration workflow to integrate data from different data sinks (e.g. social media platforms). We thereby focus on leveraging the text and image components of UGC jointly to extract target topic relevant posts from all platforms to build our merged dataset. Taking advantage of this dual-modality in a structural fashion was shown to improve classification tasks. Building a data integration workflow that can accommodate the described functionality requires automated computer vision and text analysis methods.
In an example to demonstrate the functionality of our approach we set out to compile an integrated dataset of bird sightings, specifically for the species Red Kite (Milvus milvus). Some adaptions to the workflow are needed to fulfill this task. First of all, a custom image classification model able to detect Red Kites enabled the visual content analysis. Secondly, common names from six languages and the Latin name for Red Kite were added as keywords to customise the textual content analysis. We used Chilterns, an Area of Outstanding Natural Beauty (AONB), located North-West of London as research area to analyse the properties of the integrated dataset based on:
- spatial & temporal coverage
- data volume
- unique user base (demographics)
During our studies, we work on the integration of three data sources, namely
The integration of Red Kite observations from Flickr to the other two CSPs in the area of Chiltern’s yielded the following results:
- 12% increase in data volume
- increase in data availability for the years 2004 - 2015
- 20% increase in unique users