2019. 11. 27. 17:12ㆍPython programming
(1) pd.concat( [ df1, df2 ] , ignore_index=True )
(2) pd.merge( left = , right= , on = '컬럼이름', how = 'left' )
- Inner: only includes elements that appear in both dataframes with a common key [디폴트]
- Outer: includes all data from both dataframes
- Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes
- Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes

(3) pd.concat() vs pd.merge()


====================================================
Working With Strings In Pandas
====================================================
(4) pd.rename( )
컬럼명 A에서 B로 변환할 때
col_rename = {'A':'B'}
df = df.rename( col_rename, axis=1 )
(5) string.split() method
- 기능 : 단어 하나씩 자르기
(6) split 외에도 이렇게나 많은 vector가 있지여 vectorized string method들
- 텍스트 데이터 다룰 때, 성능을 위해 Series.apply() 보다는 built-in vectorized methods 쓰는 게 좋다
- using vectorized string methods 의 장점
1. Better performance
2. Code that is easier to read and write
3. Automatically excludes missing values
- 참고 문서 : working with text data

(7) 문자 안에서 패턴을 찾기 위한 정규표현식(regex)
https://docs.python.org/3.4/library/re.html
(8) Series.str.contains() method
- if a specific phrase appeared in a series 를 검사, 결과값은 True, False, and missing values 로 표현됨
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
pandas.Series.str.contains — pandas 1.0.0 documentation
Analogous, but stricter, relying on re.match instead of re.search.
pandas.pydata.org
(9) Series.str.extract() method

예시) pattern = r"([1-2][0-9]{3})"
years = merged['SpecialNotes'].str.extract(pattern, expand=True)
(10) Series.str.extractall()
- 비교 : Series.str.extract() method will only extract the first match of the pattern.
- If we wanted to extract all of the matches, we can use the Series.str.extractall() method.
예시) pattern = r"(?P[1-2][0-9]{3})"
# 데이터 예시
# Integrated household survey (IHS), 2012
# Integrated household survey (IHS), 2010/11
years = merged['IESurvey'].str.extractall(pattern)
value_counts= years['Years'].value_counts()
value_counts
2012 33
2010 28
2011 22
예시2)

pattern = r"(?P[1-2][0-9]{3})/?(?P[0-9]{2})?"
# a question mark, ?, after each of the two new groups to indicate that a match for those groups is optional.
# 그래서 yyyy 도 검색되고 yyyy/yy도 검색된다
years = merged['IESurvey'].str.extractall(pattern)
first_two_year = years['First_Year'].str[0:2] #맨 앞 20을 떼서
years['Second_Year'] = first_two_year + years['Second_Year'] #20과 19를 붙여서 2019를 만든다
- 추가적인 참고사항
1. regex를 괄호로 묶어줘야 인식 가능.If part of the regex isn't grouped using parantheses, (), it won't be extracted.
2. When we add a string to a column using the plus sign, +, pandas will add that string to every value in the column. Note that the strings will be added together without any spaces.
'Python programming' 카테고리의 다른 글
기초통계 : 평균, 분산 (0) | 2019.12.21 |
---|---|
또다른 plot 그리기 (0) | 2019.12.15 |
그래프를 그려요 - matplotlib (0) | 2019.11.25 |
Pandas 를 써봐요 (0) | 2019.11.23 |
CLASS 를 이해해요 (0) | 2019.11.21 |