By the end of this notebook, you will be expected to:
- Understand project life cycle management in big data projects;
- Categorize data in social physics;
- Understand the interplay between social analytics and technology; and
- Develop an awareness of the challenges in scaling analytical projects, or integrating these within large organizations.
- Exercise 1: List advantages of multimodal approaches.
- Exercise 2: Identify various data categories and the information they reveal about individual behavior.
- Exercise 3: Identify a social analytics use case based on recent technological innovations.
- Exercise 4: List problems that may arise in managing large analytical projects, and suggest corrective actions.
While you have been introduced to the many technical concepts, tools, and technologies, little attention has been paid to people and processes. These are critical elements to consider when undertaking projects using data. Technological advances mean that tools and data collection methods are becoming more accessible, and that projects that previously required significant investment (in terms of human resources and technology) can be undertaken by a wider range of audiences. Your use cases and reasons for undertaking projects will vary, but you will still need to interact with a variety of stakeholders.
This notebook introduces a big data project methodology that will allow you to set up big data projects in your own social or commercial context. The existing project is used to demonstrate some of the key concepts that you can use as input in setting up your own projects in the future.
In the video content, Professor Pentland referred to specific concepts that can also be applied to your project. Some of these include the following:
Note:
While the content introduced in this course aims to provide you with insights typically applied in social analytics projects, you can also use these insights to ensure the successful completion of your project. For example, projects where multiple parties work on similar projects with a shared vision typically have a higher likelihood of success than those performed by individuals in isolation. From a systems point of view, this can be attributed to items such as shared tasks (data collected once and used multiple times). However, people often underestimate the impact of social context when executing projects. Keep all of the content introduced in the course in mind while working through the content presented in this notebook, and ensure that you apply these concepts when setting up or interacting with other parties in your future endeavors.
Refer to the abstract of the paper titled “Social fMRI: Investigating and shaping social mechanisms in the real world”, provided below. You are encouraged to work through the detail of the paper over the next two weeks.
We introduce the Friends and Family study, a longitudinal living laboratory in a residential community. In this study, we employ a ubiquitous computing approach, Social Functional Mechanism-design and Relationship Imaging, or Social fMRI, that combines extremely rich data collection with the ability to conduct targeted experimental interventions with study populations. We present our mobile-phone-based social and behavioral sensing system, deployed in the wild for over 15 months. Finally, we present three investigations performed during the study, looking into the connection between individuals’ social behavior and their financial status, network effects in decision making, and a novel intervention aimed at increasing physical activity in the subject population. Results demonstrate the value of social factors for choice, motivation, and adherence, and enable quantifying the contribution of different incentive mechanisms.
(Aharony et al. 2011)
Taking a "People, Process, Technology, Data", high-level view of projects can be useful when planning or communicating projects at a conceptual level. Most project management methodologies will contain similar elements to those depicted in the following image (SAS Institute 2012).
Reuse existing skills, and budget time and training for your human resources. Many aspects of big data projects require additional learning curves and training. The nature of the types of analyses performed means that it is an iterative process, by definition, and many of the steps involved are extremely difficult to budget for using traditional project management methodologies. Ensure that you have the relevant skills available, and that you deal with the introduction of new tools and methodologies appropriately. Make sure that the analysts work in an environment where other members in the organization can support them, or that said analysts work with tools where there is a rich community of support (social context).
Typical roles include the following:
The “process” component of projects refers to the following aspects:
Academics have a well-defined and rigorous process in place to deal with many of the issues that may arise. In the business and commercial world, many organizations are moving from waterfall-based approaches to more agile and "fail-fast" or "data-driven approaches".
The “technology” component of projects is characterized by the need to:
Be careful of moving to this step too quickly. Resolve the function and architecture before making technology choices.
The “data” component of projects involves carrying out the following three steps:
Approaching projects purely from a technical or analytical point of view often delivers underwhelming results. Be aware that once you reach an insight, you will also have to act on that insight, and that action needs to be performed in the social context of your organization or environment. This may be in the form of a simple change in an existing business process. However, in many cases, it may be necessary to implement significant changes in existing processes, new processes, or products and services that require some form of change management in order to implement the proposed changes. Review topics around organizational transformation and “data-driven approaches” for further information and guidance. Many of the approaches advocate iterative or “fail-fast” approaches in order to test the concept before embarking on full-scale projects. The approach you follow will be highly dependent on the type and size of the organization, and potentially on the tools and resources at your disposal. The majority of approaches contain similar elements, which will be explored in this notebook, and are listed below:
In their research paper, Aharony et al. (2011) include the following paragraph to describe their vision:
Imagine the ability to place an imaging chamber around an entire community. Imagine the ability to record and display nearly every facet and dimension of behavior, communication, and social interaction among the members of the said community. Moreover, envision being able to conduct interventions in the community, while measuring their effect — by both automatic sensor tools as well as qualitative assessment of the individual subjects. Now, think about doing this for an entire year, while the members of the community go about their everyday lives.
(Aharony et al. 2011)
You may find inspiration in science, technology, or attempting to solve a business need, or your inspiration may be driven by personal interest. Whatever your reason for choosing to initiate a project, you still need to communicate the ideas to multiple stakeholders and plan your project to ensure success.
Starting with a clearly-stated and referenceable vision is a good way to ensure that you can communicate the project’s intention to multiple stakeholders, and get them excited and focused on the topic being studied. When setting the vision, keep the potential stakeholders and decision makers in mind. This is usually championed by an individual, but may form part of formal structures such as thesis proposals or workshops with multiple potential stakeholders. In terms of your documentation, abstracts, introductions, and overviews are typically completed last as these items are frequently revisited and updated throughout the course of the project.
The processes in academia, and specifically postgraduate studies, are set up to ensure that prospective students research their topic of interest, and clearly define their intended area of research. The research methodologies are carefully validated to ensure that all relevant aspects are addressed. Business applications tend to be less rigorous regarding processes, while having an increased focus on achieving monetary results. Work through the details of the referenced paper by Aharony et al. (2011), and think carefully about where each of the described sections may be applicable to your potential use cases.
Review Section 2 of the referenced paper: Related work and context.
Conducting your own research will enable you to benefit from incorporating the latest research in your analytics efforts. At this stage, many of your questions may have already been addressed to some extent, allowing you to build on the work completed by others in order to accelerate your efforts. However, other questions may not be answered directly.
You can use project methodologies that are applicable to your industry, or ones that you are familiar with, but it is highly advisable to keep a few topics in mind.
In your research, you will likely encounter subject material relevant to your project, which you can use as input. This may include subject matter or business value frameworks, project plans, typical questions, and plans to answer these questions.
Focusing on the value perspective from an early point helps to keep you focused. This involves considering the following questions:
Accelerators may include using programming libraries – such as Pandas for data analysis, or Bandicoot for the interrogation of mobile phone metadata – but you will also be able to leverage the content from studies and service providers in order to accelerate your efforts.
New technological options, and applying the best tools or processes ("fit for purpose"), may achieve results in dramatically-reduced timeframes or at a fraction of the cost typically associated with large-scale studies. However, you should start building a view of the resources (available and required) at an early stage.
Lastly, you should decide on a strategy to visualize and communicate your project.
Review Section 3 of the referenced paper. In this section, there is a clear example of how to frame your intentions in a manner that can be easily communicated to stakeholders.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
The paper describes the methodology employed in detail. When embarking on your journey, there will be additional items to consider which typically do not make it into the final research paper or the final product. Some of these items (which are outside of the scope of the course) are included here for your consideration.
In the video content, earlier in the course, Arek Stopczynski referred to data collection as being expensive. Carefully review the data sources already available, as well as potential new sources of data. You will need to balance the need for trusted, high-accuracy data or low-grain data with the costs and overheads associated with obtaining the data. In many cases, your pilot project will focus on easier-to-obtain data, and you may need to use proxies or subsets of data to demonstrate the usefulness of the concept being studied. Many of the questions that you will need to answer may potentially be answered using a small subset of data (this concept is known as the Pareto principle). There would typically be a number of use cases for the same set of data. Think of social network data. While many organizations, or units within organizations, are excited about the prospect of using this rich source of data, few are able to obtain value from it.
Data preparation can be a tedious process, and typically requires more time than is budgeted for. Accelerators aid in transformation, and the governance of data can help you to reduce ongoing required efforts. When thinking about the creation of data products or implementing production systems, you should typically want to automate these steps, in order to free up capacity of your data scientists, ensuring that they can spend their time on useful tasks rather than repetitive ones.
In addressing analytics concerns, you would build up a "toolbox" as you practise and implement your analytics capabilities. These may include:
Section 4.2 of the referenced paper contains a number of data sources such as mobile phone call and sensor records, surveys, purchasing behavior, and Facebook data. Identify the major groups into which these data can be categorized, and describe what these types of records would typically tell you about the behavior of the individuals.
Hint: Refer to the Module 1 and 2 content in which the difference between how you want to be perceived and what you commit to is described.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Review: Section 5 of the referenced paper.
The focus of the provided paper is on the mobile phone platform. It defines the sensors used, and describes the data formats, data movement protocols, and the data storage structure.
The system architecture is a conceptual model that defines and describes the technical components contained in the system structure, as well as the system’s behavior. It is typically used for planning and communication purposes when the components need to be described to interested parties.
Usually, you would want to create a system that delivers fast and scalable results, and the components selected will vary greatly, based on available existing resources as well as new requirements and options. Often, you would start with a conceptual or logical model that describes the functions and flow of information in the system. At a later stage of the process, you would revisit granular details such as the physical implementation and specific components that were used.
Many people get stuck when trying to select the technology first, then extending and revisiting the logical and conceptual models at later stages. This approach carries the risk of diluting the project’s purpose, and getting lost in technicalities rather than focusing on the defined purpose.
Privacy considerations will be addressed in the second notebook of this module. However, at this stage, it is important to note that traditional methods that are applied to anonymize data in relational or file-based systems are usually no longer adequate. This necessitates approaches such as “privacy by design” to ensure that you deal with sensitive and potentially-sensitive data appropriately. Consider data collected for internal and external use. Your internal researchers may require access to granular and unmasked data, while other business users or applications may only need to access anonymized data sets.
While the technical components of systems architecture fall outside of the scope of this course, the following section briefly outlines a number of items that you should carefully consider when defining your architecture.
Consider the following points regarding dealing with data before defining your target system architecture:
Provide a short description of a recent technological component or trend that you think is significant. This should be selected based on your personal experience or interests, as many of the topics on which technological components or trends are based fall outside of the scope of this course.
Hint: APIs, mobile applications, cloud-based computing, and interactive computing are examples of topics that are not covered in detail in this course. Should you wish to do so, you can also base your answer on technological concepts introduced in this course.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
When carrying out the phases of a project (plan), it is helpful to make use of a checklist to ensure each step in the plan is executed, and each aspect accounted for. The following sections outline typical phases of a project plan, and highlight some of the items that should be added to your checklists.
Research the topic under review as well as related topics that can be used to add context. Once this is done, you will need to define your project framework, by carrying out the following steps:
Pilot projects typically focus on prioritized use cases. In light of this, it is important to do the following:
Data analysis typically follows the same steps as listed in previous modules with short iterations.
Create a roadmap or an implementation plan. This can be done by doing the following:
Project termination generally requires dependencies to be removed, and data to be dealt with appropriately. Ensure that you are aware of the legal requirements when disposing of or archiving sensitive data. Architecture documents (describing both systems and interactions between different systems) can be of significant value during this phase.
Note: In the second part of this notebook, which you will review in Module 7, you will revisit the technical sections (Sections 6 to 11), as well as the conclusion of the referenced paper.
Project management methodologies and best practices can ensure that you get started quickly. They can, however, also introduce a significant amount of overheads. These "best practices" can be used to guide you in unknown areas, and help you deal with the complexities of running large projects.
List two problems that you expect to experience when attempting to scale small-scale or ad hoc analyses to large-scale implementations; or when incorporating these analyses within large organizations. Propose simple corrective actions for each.
Note: These concepts are not dealt with in detail in this course. The aim is to look for insights based on the tools and technologies introduced in this course, rather than attempting to test your knowledge of the various frameworks that exist.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Aharony, Nadav, Wei Pan, Cory Ip, Inas Khayal, Alex Pentland. 2011. “SocialfMRI: Investigating and shaping social mechanisms in the real world.” Pervasive and Mobile Computing 7:643-659.
SAS Institute. 2012. “Analytics Infrastructure: 15 Considerations.” Accessed September 10. http://blogs.sas.com/content/datamanagement/2012/05/09/analytics-infrastructure-15-considerations/.
In [ ]: