[Spark] 스파크 Dataframe 데이터프레임 가공하기

DataProcessing/Spark 2021. 1. 31. 23:57

모듈 import

import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import SQLContext

## Cassandra
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = \
    '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.1' \
    ' --conf spark.cassandra.connection.host=localhost:port pyspark-shell'
    

sc = SparkContext(appName="app name")
sqlContext = SQLContext(sc)

CassandraDB 데이터 조회 후 Dataframe생성(filter by column name)

dataframe = sqlContext.read\
        .format("org.apache.spark.sql.cassandra")\
        .options(table=table, keyspace=keyspace)\
        .load() \
        .filter((col("topic")=="covid-19") & (col("datehour")=="-1280455"))

Dataframe column 삭제

# remove 'matchin_rules' column
dataframe = dataframe.drop('matching_rules')

Dataframe 특정 column 조회

dataframe = dataframe.select("topic", "message", "author_id", "place_id", "source", "referenced_tweets", "includes")

Dataframe summary 통계치 구하기

# summary statistics
tweets.describe().show()  # dataframe 출력

특정 column으로 row 개수 count하기

# count user_id
tweets_author = tweets.select("author_id")
user_id = tweets_author.distinct().count()  # int값 반환

# count region
tweets_place = tweets.groupBy("place_id").count()
tweets_place.show()

# count source
tweets_source = tweets.groupBy("source").count()
tweets_source.show()

- 이상 오늘의 삽질일기 끝!

여기저기 삽질도 해보고

날려도 먹으면서

배우는 게

결국 남는거다

- Z.Sabziller

저작자표시

'DataProcessing > Spark' 카테고리의 다른 글

[Spark] 데이터 가공(Feat. 코로나 Trend분석) (0)	2021.02.17
[환경설정] Spark 설치 및 ubuntu 환경 설정 (feat.AWS) (0)	2021.02.15
[Spark] Trend 분석 연관어 빈도수 구하기 (feat. 불용어 처리) (0)	2021.02.06
[Spark] Tutorial #1 데이터 조회, 가공 & 데이터프레임 생성 (0)	2021.01.24
[환경설정] spark 스파크 jupyter notebook 실행 설정 (0)	2021.01.17

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

쫄보의삽질 블로그 쫄보의삽질 블로그

'DataProcessing > Spark' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

'DataProcessing > Spark' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역