Using Big Data to find patterns of climate conversations on the web by CC NOW

Description

Summary

The climate change corpus includes a complex array of narratives - it is an assortment of intellectual debates, conspiracy theories, psychological traits of denial and disbelief, geopolitics and scientific publications. People’s attitude towards climate change depends on geographical location, structure of society, economic condition, and cultural identity. Therefore, shaping the collective perspective on climate change requires an understanding of human psychology.

Our proposal aims to study online activities in different social networking websites by collecting huge amounts of data from people’s clicks, comments, ratings and likes/dislikes. This data reveal information on people’s perception of climate change and exhibits a wide spectrum of opinions. For example, the comment section of one of the Youtube trailers of “An Inconvenient Truth” (a documentary on climate change) shows the different ways people interpret data, graphs and stories on climate change. Many viewers expressed their opinions on climate change by engaging in intellectual debate, explaining the statistics used in the video and mentioning recent scientific discoveries, whereas some viewers commented on politics, the quality of the video and completely irrelevant information including insults. Not only the content, but also the language itself contains valuable information on people’s sentiments, optimism, and frustration. Therefore aggregating and analyzing these comments from the climate change related videos on Youtube can help us understand people’s perspectives on climate change. By accumulating more information from blogs and other social networking websites, i.e. Facebook, Twitter, Linked In and Google+, and by analyzing this high-volume data, we can create web contents that are persuasive and effective in making lasting impressions on people’s mind.

Category of the action

Changing public perceptions on climate change

What actions do you propose?

The idea of our project stemmed from casual conversations when the team members were bouncing around random ideas on web scraping, a technique to collect data from websites and online portals. With the increasing popularity of social networking/community sites, the internet has become a major source of behavioral data. Everyday enormous amounts of user-based contents are generated that engage people to interact more with each other. Analyzing these contents and the corresponding web traffic can provide valuable insights on human behavior, both on and off the virtual space.

For example, one of our research interests is to study the ways people collect information from online by analyzing the spatial distribution of different climate change related search terms and statistical association with similar keywords. Even simple and publicly available tools like Google Trends and Google Correlate can be extremely resourceful in this regard. A sample of interesting results are displayed below:

Fig: Interest in seach word "climate change" over time (2004-2014), source: Google Trends

Fig: Correlation between two search keywords in USA (Jan, 2004 - July, 2014)

Source: Google Correlation, monthly time series, 2014

Obviously, it is naïve to claim that a single comment or tweet will expose all the strings of a person’s psyche; the project does not intend to analyze individual behavior at a micro level. However, accumulating millions of web contents (image, video and text) and exploring people’s interaction/perspective on these contents will assists us to recognize a pattern. This behavioral pattern will help to create more persuasive contents like photo illustrations, cartoons and educational videos to change people’s attitude towards climate change.

Our objective is to –

(a) Collect large volume of data from blogs and social networks

(b) Create web contents that actually alters individual decisions (consumer habits, lifestyle) and industry practices (focus on sustainable production of goods and services)

We will use statistical regressions to calculate the coefficients for different influential factors (viral videos, internet memes, etc) which will be further used to create our own contents. With this, we can examine the changes in people’s behavior by Google Analytics.

This approach is fundamentally different than asking people directly about their perception of climate change. There are several surveys that studied people’s perspectives on scientific issues and engagement in environmental issues. The responses often do not reflect the true behavior or do not capture the degree of trade-off. Therefore, a revealed preference approach, where the observed behavior is analyzed to derive the behavioral pattern would be a more effective one.

Our workflow is divided into three phases -

Planning, programming and prototyping

I(a) Collecting Behavioral Data from Web:

The virtual world of communication is rich with real time interactions. News on climate change hosted by online media creates the scope of writing comments on that news in a real-time setting. Similarly, a video on YouTube opens the platform to have a conversation among viewers. Besides Youtube and news portals, there are personal blogs and official websites of different organizations that publish a huge amount of web contents on climate change and other environmental issues.

We are interested in capturing this information and sorting them according to our desired variables. Most of the news portals or social networking sites allow collecting data through individual APIs.

I(b) Pilot study

While writing our proposal, we came up with the idea to conduct a small scale trial to refine our concepts. The purpose was to try different web scrapping queries and making ourselves familiar with different APIs and natural language processing techniques.

Type of data: text in English language (in future we will modify the code for other languages in Unicode)

Sample selection: We used the TED talk channel from Youtube for our variable group. Under the http://www.youtube.com/user/TEDtalksDirector url, we searched videos by using the key word “climate change”. After filtering the search results by view counts, we selected the top 10 videos for analyzing their comment section.

To control for external factors (i.e. popularity of the up-loader) we have applied the similar for (a) top 10 non-TED Youtube videos (b) top 10 non-TED Youtube videos by the search word "global warming".

Data: We extracted the following information for each video – title, view counts, date of publication, tags, likes-dislikes, total number of comments, number of replies from each comment, actual text of the comment etc.

Methodology and Analysis

Sentiment Analysis and Opinion Mining

The sentiment analysis code was developed to find out the degree of belief in climate change in the comments (i.e. positive, negative or neutral). After that, a list of all of the words in the text was created and these words were ordered by invoking ‘FreqDist’ function of NLTK at the frequency in which they appear. Three files containing positive, negative and neutral comments were developed for training dataset. Tagging was done using zip function for these three files- this function combines two lists of tuples. Texts were tagged with their respective sentiment (i.e. positive and negative). Finally a feature extractor was created that extracts relevant information about the list of positive, negative and neutral sentences, and passes it to the classifier for it to be trained on this data. The feature extractor will return a dictionary that will tell us if any of our words in the input of the feature extractor are also found on our custom word-list. This will then be used to compare the words found in our original text list with the words found on word-list. After comparing, classifier is trained on the training set created. This classifier will attempt to read the sentiment of the input according to the positive/negative/neutral associations of the words that it learned from the training set.

The script results in an output file where the comments are categorized into different segments of sentiments. For example, a comment like “There is substantial solid science and evidence behind Global warming and Anthropomorphic climate change” has been considered as a positive sentiment, whereas “MAN-MADE Global Warming or Climate change is an EVEN BIGGER LIE” portrays a negative one.

Tag cloud of the comment section (top 10 TED videos by "climate change" search words:

Paraphrasing

II (a) Scaling up the data collection process

In this phase, the data collection process will be enhanced to collect a larger volume of information. The PHP scripts will be automated to scan for climate change related texts and extract them in a pre-defined format (XML, JSON). We will collect data from Twitter, Facebook, Linked In and Google+. Scaling up the data collection process will enhance the accuracy of inferential analysis. We will use Python and PHP queries to extract and stores the data into a cloud server.

II (b) Creating common resource pool

The knowledge base from phase I and phase II will be shared with government organizations as well as universities and other think-tanks. We will demonstrate the trends we find out from analyzing the initial data through traditional press release and social networking sites. Similarly, we will develop a data sharing process through which the actual dataset will be shared. For privacy reasons, the data will be partly encrypted and personal id, geocoded information will be blinded to make sure it aligns with all API terms.

II (c) Pattern recognition and predictive analysis

We will use the Big Data (definition, application) to explore any visible trends in the way people interact with different contents and in different platforms. Some of the areas of interest are as follows-

Is there a systematic difference (if yes, to what extent) in the way people comment on LinkedIn, Facebook and Google +? What are the factors influencing this difference?

What type of hashtags in Twitter or Instagram address climate change? How many replies, share, likes do these Tweets or images receive?

What type of positive and negative sentiments are associated with different terms used in the comment section? When people argue, what type of reference, examples do they use to support their opinion?

Gallery: A collection of popular web contents on climate change

Fig: A Youtube viral video (3 million+ views) on climate change (source)

Fig: Screenshot on top 10 popular hashtags related to climate change in Twitter (source)

Fig: Selected memes, illustrations (source: 1, 2, 3, 4)

These type of viral videos, anchoring texts (hashtags) and internet memes attracted a lot of online users who added valuable information and expressed frank, personal opinions. Studying these contents can reveal the "appeal" and "influence" factors of these web contents.

Partnering with government and perfecting the art of persuasion

At this stage, we will share our common resource pool with not only academic institutions but also with government agencies. Typically government organizations benefit from “economies of scale” and this will enable us to link our dataset with other demographic datasets like census or labor force surveys.

III (a) Experimenting with web contents

In this phase, we will have more insights about the behavioral trend which will facilitate to create our own web contents and test their ability to engage people into meaningful discussion. This heuristic process involves presenting same piece of information in different ways and testing their effectiveness.

Let’s consider an example - ranking countries by annual carbon dioxide emissions per capita. We can present the information through a table, or a bar diagram, or a map (color coding based on volume of emissions), create a video montage and many other several ways. Now how do we know which one is more effective on engaging people? Indicators commonly used are (a) average time people spend on that particular webpage (b) number of comments, likes/votes, and shares (c) clicks/scrolls on the body of the text. We can use these indicators to measure the potency and popularity of climate change related texts, images and videos.

Who will take these actions?

Phase I: Independent researchers

In the initial phase, the pilot study was conducted by the following "CC NOW" team members -

MD. Moman-Ul Haque Khan (Software Engineer)
Nafis Hasan (Graduate Student, Tufts University)
Mohib-Ul-Haque Khan (Graduate Student,University of Alberta)
Fahim Hassan (Research Analyst, University of Alberta)

In successive phases, we will collaborate with researchers from different disciplines (computer programming, linguistics, statistics and psychology) by sharing our database and initiating joint projects.

Phase II: Universities, non-profit research organizations

There will be partnership projects among different study groups/labs, academic institutions and non-government organizations to enhance knowledge sharing and public engagement. We strongly believe in leveraging the expertise of existing study/support groups and does not recommend creating another form of groups of organization. Rather it aims to enrich the ongoing conversation, amplify the message and broadcast it to right ears. For example, instead of opening a student club in a University, we will connect with an existing student club and jointly organize events. We will build partnership with them and organize mock parliamentary debates on environmental issues. Similarly, we will help the local movie clubs to show documentaries on Nature and collaborate with student photography clubs to organize photo exhibitions on the impact of climate change. This way, “climate change” is not an additional issue to work on; it will just get embedded in all the socio-cultural aspects.

Phase III: Government organizations, social co-operatives

The scope of the collaborative projects will be extended to government organizations. In most of the countries the central government already has a cell, usually nested inside Ministry of Environment, which addresses the climate change issues. Sharing our knowledge with the government will mobilize the knowledge sharing objective.

Where will these actions be taken?

The workflow does not cluster on a particular geographical region. For the phase I and II, anyone from anywhere can join as long as there is internet connection for effective communication. During our pilot study, the core team connected with researchers from 10 different countries in 4 continents. Team-building and training will be done mostly through conference calls and webinars.

For phase III, there will be a series of area specific events. Adaptation activities will be prioritized on the basis of vulnerabilities, which means coastal areas or islands that are more exposed to risk will be weighted more. We will connect with local government and educational institutions to cater our content towards those areas. For example, the southern regions of Bangladesh are highly vulnerable, even a minor rise on the sea level can flood the riparian habitats. Regions like this will be given a higher priority.

How much will emissions be reduced or sequestered vs. business as usual levels?

Translation of online activism into actual actions is hard to measure but there is a plethora of literature showing how internet activism can lead to actual change, socially and politically. Additionally, social and political movements in the recent years have shown the necessity of a strong online voice for creating positive change in popular opinion and governmental policies. Although it will be extremely difficult to measure the amount of emission reduction that will directly come from this particular project, it is apparent that the research and activism, combined together, will create a positive web experience for climate enthusiasts and researchers.

In later stages of our project, we might be able to actually create an index to measure the online trends from self-tracking apps (there are several apps which tracks individual carbon footprints) and what percentage of inspiration comes from online. This index can be used as a proxy to measure the impact of our project.

What are other key benefits?

Data driven strategy
One of the advantages of this proposal is that it relies heavily on observational data instead of intuition or assumptions. Survey techniques are susceptible to sampling and response biases. By using Big Data, we hope to remove such bias since it gives us a much larger sample size.

Collaborative approach

The proposal is inclusive in nature and opens the platform for researchers from other disciplines to contribute. We look forward to share the database with research organizations. We will also share our code through GitHub.

Diverse application
Techniques like natural language processing, web analytics and A/B testing have multiple applications. Even other proposals can be benefited from using these tools.

Low cost
The project does not require a lot of financial resource (software used for data extraction and analysis was open source and free).

Pragmatism
From the beginning, we are focusing on prototyping and already conducted a small scale trial to ensure feasibility.

What are the proposal’s costs?

The team believes in lean startup, and one of the building blocks of this proposal is resource sharing. The proposal relies heavily on using open source software, sharing desks and community meet-up spaces. An estimated budget* for Phase I and II are given below:

Phase I:

Type of cost: Mostly fixed cost

Estimated total cost: 3,000 USD (apprx.)

Composition:

Cost of online storage $ 500
Print and publication $1000
Communication (Phone, Skype, Conference call) $500
Books and study materials $1000
Salaries/payments: $0 (Thanks to all the volunteers!)

Phase II:

Type of cost: Operational cost

Estimated total cost: 35,000 USD(apprx.)

Composition:

Travel cost: Cost of attending conferences and workshops (registration, airfare and residency) - $10,000
Knowledge transfer: Includes production of infographics, posters, printing, journal subscription & miscellaneous cost for community engagement $3,000
Big Data infrastructure: Online storage (most probably Hadoop) $5,000
IT maintenance: A dedicated team to maintain websites, blogs, database management $5,000
Salaries/Payments: Will mostly hire freelancers. The typical positions we will be looking are: graphic designers, programmers, media & communication person, $10,000
Marketing and promotions: Google AdWords, Facebook ads $2,000

Phase III

Yet to be determined

* All numbers are estimated based on current price quotations in July, 2014

Time line

The timeline of the actions depends on various factors. Establishing the Big Data infrastructure has become comparatively easier in recent times. One of the challenging tasks will be to create a financially sustainable knowledge platform which will support researchers to continue the data collection and analysis piece and to experiment with creative contents.

Short term:

Phase I: around 1-2 years

Phase II: around 3-5 years

Long term:

Phase III: 20-30 years

Related proposals

Youth Action for Climate Change - as majority of the youth population interacts online, the project can benefit from referencing our research and using the persuasive contents to inspire and educate their targeted demography.

LiveSMART - this proposal can use our research results to make their apps better and to generate new ideas for stories.

References

Rayport, Jeffrey F. "Use Big Data to Predict Your Customers' Behaviors." Harvard Business Review. Harvard Business Review, Sept. 2012. Web. 10 July 2014.

Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. Beijing: O'Reilly, 2009. Print.

McCaughey, Martha, and Michael D. Ayers. Cyberactivism: Online Activism in Theory and Practice. New York: Routledge, 2003. Print.

Eds. Ayers, Michael D., Mccaughey, Martha. "Classifying Forms of Online Activism: The Case of Cyberprotests Against the World Bank" Cyberactivism: Online Activism in Theory and PRACTICE,pp. 72-73. 2003, Routledge, New York, NY

0votes

25 supports

Proposal summaryshow/hide

Using Big Data to find patterns of climate conversations on the web

Team proposal: Only members listed on the proposal's Contributors tab will be able to edit this proposal. Members can request to join the proposal team on the Contributors tab. The proposal owner can open this proposal for anyone to edit using the Admin tab.

By: CC NOW

Contest: Shifting behavior for a changing climate 2014
How can we shift perceptions, values, norms, and attitudes to inspire individuals and institutions to take action on climate change?