Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?
Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?
By the end of this notebook, you will be expected to:
- Understand applications of big data and social science; and
- Identify potential pitfalls in applications of big data.
- Exercise 1: Interpret the risks of using and interpreting results from social data analytics.
- Exercise 2: Review the design and results of the Google Flu Trends (GFT) program.
- Exercise 3: Analyze the use of new financial risk models.
In this notebook, you will explore additional case studies and papers that deal with the application of social analytics in practice.
The collection of multi-modal and diverse ranges of signals allows us to better understand network modalities and surface patterns that are difficult to observe or extract from noisy or dirty data. Populations, and the patterns they exhibit, are complex by definition, and there is no one-size-fits-all method to approach different studies. Typically, you would start your study with physical sensors, before extending your data set with social (and other) data to better understand the community or group being studied. In many cases, it is difficult to motivate for large-scale data collection processes to be initiated, but data generated for other purposes can be reused to build and extend analyses. Data from the telecommunications industry is a rich source of insight into human behavior, and has been found to be useful in a wide range of topics across multiple fields. This notebook briefly looks at some of these use cases. Using data generated as output of other processes introduces risks (such as changes in the processes that generate the data), but offers the ability to perform analyses that were previously impossible.
Note:
The application of many of the concepts in this course is non-trivial, and specific to the use case at hand. Refer to an article by Sekara et al. (2016), which describes a theoretical framework to better understand the fundamental structures of dynamic social networks.
Human mobility offers insights into the spreading of disease as well as the ability to identify potential points of origin of diseases. You can read more about two such studies (one already introduced in the video content) below:
Known data sets can also be used to modify the concept of distance by looking at the number (weight) of people traveling between cities to create an effective distance between locations. This defined value can be used in epidemiological studies to not only predict the spread of contagions, but also to attempt to identify points of origin, based on the same effective distance from all of the locations where the contagion is observed (Brockmann and Helbing 2013).
In the video resource, you were provided with a link to an interactive tool, created by Dirk Brockmann, which displays a view of the world based on effective distances. This tool is based on the article titled “The Hidden Geometry of Complex, Network-Driven Contagion Phenomena” (Brockmann and Helbing 2013), and has been included here for students wishing to further explore the concept of effective distance.
It is important to ensure that your data set is representative of the observations made, and is relevant to the questions being analyzed. The “Friends and Family” study uses multi-modal data sets. You have also been supplied with additional examples of data sets utilized in other studies. Physical data, sensor data, or survey data can be difficult or expensive to obtain. Although social network data may be easier to obtain, it carries inherent risks. Answer the following two questions based on a hypothetical healthcare case, using data from social network sources as the main input.
a) Are all individuals with high centrality in social networks good candidates for vaccination in a healthcare intervention?
b) What is the most important physical aspect that needs to be accounted for or validated?
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Review the article titled “The Parable of Google Flu: Traps in Big Data Analysis” by Lazer et al. (2014).
a) What was the underlying premise of Google Flu Trends (GFT), a program designed by Google for real-time monitoring of flu cases around the world?
b) What went wrong with GFT between 2011 and 2013?
c) Considering that GFT is based on searches of words related to flu, explain what this means when using big data to explain reality.
Hint:
Keep Professor Pentland’s comments on correlation and prediction from earlier modules in mind.
d) How can the GFT model be improved?
Additional resources on GFT program
- "Detecting influenza epidemics using search engine query data", Jeremy Ginsberg, Matthew H. Mohebbi1, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski and Larry Brilliant (2009), Nature, vol. 457, 1012-1014. Preprint available here.
- In Defense of Google Flu Trends, Alexis C. Madrigal, The Atlantic, March 2014.
- Google Flu Trends’ Failure Shows Good Data > Big Data, Kaiser Fung, HBR, March 2014.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
In his explanation of how big data may be applied in Human Resources (HR), Arek Stopczynski discusses sociometric badges, face-to-face contact, and privacy considerations.
Big data studies that have appropriate interventions can play a significant role in ensuring a more effective, productive, and happy workforce. These studies are often highly sensitive to privacy considerations due to the short time frame between observation and potential impact.
In the video content, Arek Stopczynski also asks whether or not you care (or should care) if a random company knows where you are at any given point in time (for example, a free application on your cellphone requiring location data), and whether your feelings would change if said company was the company that you work for. Many applications where you share your data can also be used to gain insights into aspects of your life that you may not be comfortable sharing with your employer. This is because some of these insights can be taken out of context, resulting in privacy violations and subsequent loss of trust. Consider Pokemon Go as an application that can potentially be used to gain insights about you. The Council for Big Data, Ethics, and Society also published a comprehensive white paper outlining a need for ethical conduct in data science and related endeavors (Metcalf, et al., 2016).
A recent talk by one of Gartner’s analysts, Frank Buytendijk, stated that trust is an emotion. He said, “It is the confidence people have in your future behavior” (Gartner 2015). He then went on to say that “algorithms allow companies to trust people to the exact level they deserve: dynamically and at scale. Trust goes two ways. People must trust your business too, to connect to you. It is the confidence people have in your future behavior” (Gartner 2015).
You can read more about the topic on Page 11 of Gartner’s “Rising to the Challenge of Digital Business” report.
The five key dynamics to trust are:
Those who are interested in exploring additional material on trust can review the following two articles:
Once you are in a trusted relationship with employees, you can start to analyze interesting questions such as “do teams have intelligence”? This question has been answered by Anita Woolley et al. (2010), and was covered by Arek Stopczynski in the video content.
Understanding trust, and the interactions in networks, allows you to make optimal use of the available technologies, structures, and teams, and to build more effective and productive virtual teams. Before considering the potential interventions that you can apply in your particular industry, review the article titled “Getting Virtual Teams Right”, for more information on this topic.
In the financial industry, traditional use of data focused on age, gender, job type, and marital status. This use was oblivious to spatial temporal data and habits of individuals. In the video content, Xiaowen Dong discusses a number of applications of big data in the financial and marketing industries. Many of the topics are relevant to many organizations as these are core activities in most organizations.
Note: You can read more about the use of financial big data analysis for the estimation of systemic risks in the University of Pavia’s DEM Working Paper Series.
A number of initiatives and companies have found ways of using available data sources to build alternative models for assessing credit worthiness that are not based on credit bureau data. These initiatives range from government initiatives to creating products in developing countries in order to provide access to communities that did not have access to these services (for example, “promoting financial inclusion”, “filling the gaps between financial access and financial inclusion”, and “Big data, small credit”). You can read more about how to use call data to tap into the low-income customer base through mobile phones.
Review the article titled “Using new models and big data to better understand financial risk”, by Jennifer Donovan (2016).
a) What fundamental observation regarding financial outcomes allows you to gain insights and build products for individuals based on mobile data?
Note: Your answer can be a single statement from the referenced article or you can explain the observation in your own words, based on the content presented in this course.
b) Traditional credit assessments typically build models across three dimensions of an individual: affordability, stability, and willingness (to pay back an extended loan). How can these three characteristics be defined based on mobile phone metadata? In your answer, consider what you have learned regarding social network analysis and predictive modeling using Bandicoot features.
c) Which other sources of data (excluding credit bureau data) can be considered to improve these alternative credit-worthiness assessment models, based only on mobile behavior patterns?
d) What major risk(s) can you think of when credit assessment models are built using only mobile metadata?
Hint:
Refer to the earlier discussion on GFT.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Brockmann, Dirk, and Dirk Helbing. 2013. “The Hidden Geometry of Complex, Network-Driven Contagion Phenomena.” Science 342:1337-1342. doi:10.1126/science.1245200.
Gartner. 2015. “Gartner Says the Economics of Connections in Digital Business are Accelerating by the Use of Algorithms.” Last modified October 5. http://www.gartner.com/newsroom/id/3142918.
Metcalf, Jacob, Emily F. Keller, and Danah Boyd. 2017. “Perspectives on Big Data, Ethics, and Society.” Council for Big Data, Ethics, and Society. Last Accessed January 24, 2017. http://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/.
In [ ]: