2021. 6. 24. 22:33ㆍ카테고리 없음
Understanding User Behavior Through Log Data and Analysis (Version of Apr 23, 2013) Susan Dumais (Microsoft) Robin Jeffries (Google) Daniel M Russell (Google) Diane Tang (Google) Jaime Teevan (Microsoft) |
출처: https://ils.unc.edu/courses/2016_fall/inls509_002/papers/Dumais14.pdf
DEFINITION
1) request: A user action or user request for information. For web applications, a new web page (or change in the current page) is requested via a set of parameters (this might be a query, a submit button at the end of a form, etc.). The result of the request is typically the unit of analysis that the researcher is interested in (e.g., a query, an email message displayed, a program being debugged).
2) cookies: A way of identifying a specific session in a browser. We typically equate a cookie with a user, but this is an important source of bias in log studies as described in more detail in Section 4.
3) diversion: How traffic is selected to be in a particular experimental condition. It might be random at the request level; it might be by user id, by cookie, or as a hybrid approach, by cookie-day (all requests from a given cookie are either in or out of the experiment each day). Typically experiments for user experience changes are user-id or cookie based.
4) triggering: Even if a request or cookie is in the experiment, the experimental change may not occur for all requests. For example, if the change is to show the current weather at the user’s location on weather-related queries, this will only trigger on the small fraction of queries that are weather related; all conditions in the experiment will provide identical experiences for other queries. On the other hand, for an experiment that changes the size of the logo shown in the upper left of all pages, all requests diverted into noncontrol conditions will trigger the change.
<로그 분석을 통해 분석하기 >
To estimate the number of observations needed to detect differences that are “interesting”, the analyst needs to determine:
- The metric(s) of interest,
- For each metric, the minimum effect size change that the experiment should be able to detect statistically, e.g., a 2% change in click-through rate,
- For each metric, the standard error.
Each of these is explained further below.
Deciding what effect size matters can be a challenge until an analyst or group has carried out enough experiments to know what level of change is practically important. But a larger problem is how to estimate the standard error, especially for metrics that are a ratio of two quantities (e.g., CTR (click-through-rate) - the number of clicks/number of queries). The most common problem arises when the unit of analysis is different than the experimental unit. For example, for CTR, the unit of analysis is a query, but for cookie-based experiments (as most user experience experiments will be), the experimental unit is a cookie (a sequence of queries) and we cannot assume that queries from the same cookie are independent. There are a variety of ways to calculate the standard error when the observations are not independent – two common ones are the delta method [Wikipedia: Delta method] and using “uniformity trials” [Tang et al., 2010].
Sizing is also impacted by the triggering rate, i.e., the fraction of the traffic that the experimental change actually impacts (see definition in Section 3.1). If a particular experimental change impacts only 5% of the traffic, then it will take 20 times as long to see an effect than if the change happens on all the traffic. Table 2 shows the effect that different triggering rates have on the number of observations (queries) needed to see an effect. Column 5 shows that more queries are required for lower triggering rates, and Column 7 shows that if counterfactuals are not logged even more queries are required since experimental differences are diluted by all the observations that are the same across conditions. In practice, it helps to have some historical information to make an educated guess about what the triggered fraction will be in calculating the number of (diverted) observations needed for the given power level.

3.5 Interpreting Results
1) Sanity Checks: The analyst’s first task is to make sure that the data makes sense.
- calculate means, standard deviations and confidence intervals for the metrics of interest.
: via a dashboard, such as in Google Analytics
- It is particularly important to look at overall traffic in all the conditions, which should be the same with random assignment.
- If there are differences in overall traffic across conditions, be sure to rule out the many artifacts (including bugs in logging) before assuming that an observed difference is a real effect.
- It is also important to break down the data in as many ways as possible – by browser, by country, etc.
It may be that that some differences are actually caused by a small subset of the population instead of the experimental manipulation.
2) Interpreting the metrics
- In log analyses, it is standard practice to use confidence intervals [Huck, 2011] rather than analysis of variance significance testing, because the conditions are not organized into a factorial design with interaction terms, and because a confidence interval gives useful information about the size of the effect and its practical significance that is not as easily visible in a significance table. It is conventional to use a 95% confidence interval in comparing each of the experimental conditions to the control.
- consider many different metrics, e.g., clickthrough rate (on results, ads, whole page), time to first click, time on page, etc.
- How does one decide which to trust? Look for converging evidence;
o are there other metrics that ought to increase/decrease when this one does, and do they move in the appropriate direction?
o Might a logging error account for the effect?
o Is this a difference seen before in other experiments?
o How this is done depends on the domain and on previous experience – for example, in search, clicks per ad-shown and clicks per page are very likely to be correlated, but these click metrics are unlikely to be correlated with conversions (when someone purchased something on a page that they got to from an ad).
Recognizing which changes
o be aware of are those that lead to Simpson’s Paradox, which arises when ratios that have different denominators are compared (Wikipedia: Simpson’s Paradox). These situations are very common in log experiments, and all experimenters should be aware of them, how to identify them, and how they change the analysis
- the broader interpretation of the results
o Generally, there will be agreement of whether a metric’s “good” direction is an increase (number of clicks) or a decrease (latency, time to click).
o trade-offs between "good" and "bad"
e.g. a classic set of tradeoffs to make among speed, efficiency, usability, and design consistency.
3) Practical Significance
- take practical significance into account.
The experiment may show a statistically reliable difference when, say, the number of ‘undo’ commands goes from 0.1% to 0.12%, but that small a number might not have any practical significance (it, of course, depends on the application and what people use undo for). Do not get so blinded by the statistics (which are easy to compute) that the practical importance (which can be harder to determine) gets lost from the discussion.
<Log Data Collection >
4.1 Data Collection
- e.g. : search results they click, and a timestamp for each of these actions.
- An ideal log would additionally allow the experimenter to reconstruct exactly what the user saw at the moment of behavior.
1) the time an event happened
2) the user session it was part of (often via a cookie),
3) the experiment condition, if any,
4) the event type and any associated parameters (e.g., on login page, user selected “create new account”).
- 주의사항
1) server의 time zone과 locat time zone 차이 주의
2) The language of the interaction
ㅇ Be careful not to confuse the user’s country with their language, or the language the UI is presented in with the language of the words people type in their interactions or queries. They will often differ, especially if the experiment runs in countries where people speak multiple languages.
3) 고객 식별 : UserIDs, HTTP cookies, IP addresses, and temporary IDs
ㅇ UserIDs : churn이 많다
ㅇ cookies and temporary IDs != each individuals (several people can use the same browser instance, and the same person may use multiple devices)
4) page request
o In web logs, knowing where a page request, such as a query, came from can be important in understanding unique behavioral patterns. Queries can be generated from many different entry points—from the home page of search engines, from the search results page, from a browser search box or address bar, from an installed toolbar, by clicking on query suggestions or other links, etc.
o Metadata of this kind (e.g., about the point of request origin) may be useful
o Ultimately, all of the data and metadata collected needs to be in service of the overall goals of analysis—this defines what needs to be logged.
5) exogenous factors
o The site might become unexpectedly viral, perhaps being picked up by SlashDot, Reddit or the New York Times. Virality can cause a huge swing in logged behavior.ㅋㅋㅋ
<Data Cleansing>
1) 대상
Missing Events
- Sometimes client applications make optimizations that (in effect) drop events that should have been recorded. One example of this is the web browser that uses a locally cached copy of a web page to implement a “go back” action. While three pages might be visited, only two events may be logged because the visit to the cached page is not seen on the server.
Dropped Data: As logs grow in size, they are frequently collected and aggregated by programs that may suffer instabilities. Gaps in logs are commonplace, and while easily spotted with visualization software, logs still need to be checked for completeness.
Misplaced Semantics: For a variety of reasons, logs often encode a series of events with short (and sometimes cryptic) tags and data. Without careful, continual curation, the meaning of a log event or its interpretation can be lost. Even more subtle, small changes in the ways logging occurs can change the semantics of the logged data. (For instance, the first version of logging code might measure time-of-event from the first click; while later versions might measure time-of-event from the time the page finishes rendering—a small change that can have a substantial impact.) Since data logging and interpretation often take place at different times and with different teams of people, keeping semantics aligned is an ongoing challenge.
2) Data transformations
- As data is modified (e.g., removing spurious events, combining duplicates, or eliminating certain kinds of “non-signal” events), the data-cleaner must annotate the metadata for the log file with each transformation performed.
- Ideally, the entire chain of cleaning transformations should be maintained and tightly associated with progressive copies of the log. Not all data transformations can be reversed, but they should be re-creatable from the original data set, given the log of actions taken.
- The metadata should have enough information so that the ‘chain of change’ can be tracked back from the original file to the one that is used in the final analysis.
- The potential misinterpretations take many forms, which we illustrate with encoding of missing data and capped data values.
ㅇ A common data cleaning challenge comes from the practice of encoding missing data with a value of ‘0’ (or worse; “zero” or “-1”). Only with knowledge of what is being captured in the log files, along with the analyst’s judgment of the meaning of missing data, can reasonable decisions be made about how to treat such data. Ideally, data logging systems represent missing data as NIL, ø, or some other nonconfusable data value. But if the logger does not and uses a value that is potentially valid as a behavioral data value, the analyst will need to distinguish valid ‘0’s from missing data ‘0’s (for example), and manually replace the missing data with a nonconfusable value.
ㅇ Capped data values, usually expressing some value on a scale that has an arbitrary max (or min) value, can cause both cleaning and analysis problems. Unless the analyst knows that a particular data value being captured varies between integer values 0…9, validating data as well as making decisions about cleaning are compromised. For example, if the log is capturing data whose value is capped at 9 (because we “know the value can never go higher”), this can lead to an insight when a long string of 9’s suddenly appears in the log stream.
3) Outliers