发布网友 发布时间:2022-04-23 08:00
共3个回答
懂视网 时间:2022-04-30 09:15
Stanford is running an entire AWS VPC devoted to analytics, which hosts:
Our data VPC also has a peering connection to our prod VPC, so that the EMR cluster machines can get access to our production RDS read-replica, needed for some of the analytics tasks.
Note that none of this is necessary. Everything will work fine as long as you can set up a cluster, the app machines, and the databases, and they can all connect to each other as needed.
Tracking logs, in recent release of edx-platform, are typically located on the app server at/edx/var/log/tracking/tracking.log-+%Y%m%d-%s
. At Stanford (and edX), the tracking logs from all our app servers get synced up to a single bucket in S3. (Stanford uses rsync). Whether it‘s pushed by the app servers or periodically synched by some other process, make sure there are no duplicate or missing tracking log files in this bucket, as that will affect the statistical calculations.
Stanford keeps a long running cluster around (1 m3.medium master node and 1 m3.medium core node) and sizes up/down the number of task instances with each task run. The article on creating an EMR cluster has more details.
Note that this is somewhat different than edx.org, which, with every task run, provisions a new EMR clusters using a custom ansible module driven by a shell script. Consult theedx-analytics-configuration repo if you are interested in this workflow.
It‘s pretty much standard RDS, but make sure your RDS security groups for the reports database (written to by the code in edx-analytics-pipeline
and read by the code in edx-analytics-data-api
) allow access by all the master and slave cluster machines (there are Security Groups associated with EMR-Master and EMR-Slave that were created for us when we launched an EMR cluster), and all the data api servers. The data API and dashboard (edx-analytics-dashboard
) django apps also need databases to function, and we just use the same DB server for these 3 databases.
The reports db is filled periodically by the luigi tasks, so a scheduler is needed. We set up a Jenkins box because it provides a nice interface to allows us to schedule jobs periodically (and to view the console output) but also run them on demand. We did a vanilla sudo apt-get install jenkins
on a Ubuntu server. However, the edx-analytics-pipeline
needs to be checked out and installed on this jenkins box, because the executable python script remote-task
supplied by the install is what kicks off the luigi tasks on the EMR cluster.
Task parameters can be supplied in 3 ways, on the command line of the remote-task
command, or via an overrides.cfg
file that lives on the file system of scheduler Jenkins box and pointed to by a command line parameter to remote-task
(This is what Stanford does currently), or in a override.cfg
kept in another repo, with the repo location being supplied by yet another command-line parameter to remote-task
.
Sundry things are mainly kept in S3, like mysql credentials files for the reports database or.jar
libraries needed by various tasks.
Once you‘re able to launch tasks and have them run to completion and confirm there‘s data in your reports mysql DB, you need to setup the data-api application servers to serve that data, from the reports MySQL db, over a REST API. There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-api) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-api.yml) so you don‘t need to do much except to edit the vars files used by the playbook.
The data api app has a self-documenting front page (https:///docs/) that you can use to test that the data is being correct served.
Once you confirm that the data API is serving up data over REST, you can set up the insights (dashboard) app which is responsible for the UX / presentation of the analytics data. need to setup the data-api application servers to serve that data over a REST API. This app does not directly interact with the reports database, but rather it makes REST calls to the data API and interprets/displays the JSON retunred.
There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-insights) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-insights.yml) so you don‘t need to do much except to edit the vars files used by the playbook.
The insights app relies on the edx-platform instance for its authentication / authorization to create a more integrated user experience. In particular, when a user visits the insights app, the app uses the OpenID Connect protocol to seamlessly create an insights account that‘s linked with the users‘ edx-platform account. The users‘ course staff privileges are also propagated from edx-platform to insights, so that users only see analytics data for courses in which they have staff privileges.
This means that some configuration is required in edx-platform to add insights as an OpenID Connect client, and that configuration needs to be in synch with configuration in the insights app. See article for details.
这是stanford大学的数据库分析
标签:
热心网友 时间:2022-04-30 06:23
斯坦福大学的计算机科学专业属于全美TOP3,斯坦福大学计算机科学系成立于1965年,在计算机理论、硬件、软件、数据库和人工智能等各个领域都居于美国乃至世界领先地位。在全球知名的IT公司中,由四位Stanford校友所创立的SUN公司名称实际上就是Stanford University Network的首字母缩写,而Yahoo公司的创始人杨致远也曾在Stanford大学就读。从某种意义上来说,离开了Stanford大学的支撑,美国计算机业界的天堂硅谷也未必会成为今日的硅谷。 斯坦福大学的环境: 斯坦福大学位于美国加利福尼亚的帕拉阿图市,与旧金山相邻,乘坐汽车只要一个小时就可以到达旧金山,旧金山是华人的聚集地,那里可以感受浓厚的中国氛围,很适合中国学子前往学习与深造。 斯坦福大学在专业课程设计和录取条件的特点和要求: 斯坦福在各个CS研究方向上都是数一数二的,该校包括的研究放向比较全面,包括了现在都比较热门的研究领域。比如说:Gaze-enhanced User Interface Design,PwdHash,Tri,Simulation & Analysis of Muscle Actuated 3D Face Models等等。这主要是因为该校强大的资金投入和众多校友的捐献。 CS科研方面,斯坦福无论在理论,数据库,软件,硬件,AI 等各个领域都是实力强劲的顶级高手。斯坦福的RISC技术后来成为SGI/MIPS的 Rx000系列微处理器的核心技术;DASH,FLASH项目更是多处理器并行计算机研究的前沿;SUIF并行化编译器成为国家资助的重点项目,在国际学术论文中SUIF编译器的提及似乎也为某些平庸的论文平添几分姿色。 斯坦福大学计算机专业的研究方向包括:Algorithms Artificial,Intelligence Bio Computation Database & Information Systems Distributed,Systems/Ubiquitous Computing Geometric Computation GraphiCS,Hardware/Architecture Human Computer Interaction Internet Systems &Infrastructure Knowledge Representation & Reasoning Machine Learning,Math Theory of Computation Natural Language & Speech Networks,Probabilistic Methods & Game Theoretic Methods Programming Languages &Compilers RobotiCS,Vision & Physical Modeling Scientific Computing,Security and Privacy Software/Operating Systems Systems,Reliability/Dependability 同学们可以先了解好这些研究方向,一般是研究生阶段会比较细分,选择上就看自己的兴趣了。 另外,申请美国斯坦福计算机专业的录取要求明确为GPA3.0以上,T600(IBT81-100)之间,要求GRE分数。 申请斯坦福大学计算机专业的心得和经验: 斯坦福大学一般要求学生拥有优秀的语言成绩,优异的本科成绩、良好的综合能力、研究能力。对于申请奖学金的学生来说,领导能力也很重要,斯坦福每年在国内招生少之又少,申请者必须能够体现自己的精英潜制。我主要谈以下几个申请要点: 计算机专业对本科所学的专业没有要求,也就是说任何专业都可以申请计算机专业的Master和PHD,但是要具备一定的定量分析能力。 假如你在其它学校获得了MS学位,那么在斯坦福你就不能再申请MS;但是如果你在其它学校获得了MSCS学位,你可以申请斯坦福的计算机专业PHD。 在同一学年你不能同时申请两次计算机专业,详细的重新申请情况可以查看学校申请网页。 与其他理工科相比,CS显然不是那么容易拿奖学金,特别是象生物、物理、化学这样一些专业,拿奖学金比较容易,全奖也比较多。 从申请难易来看,象软件工程、数据挖掘、分布式计算是现在比较热门的专业,录取的人数比较多;而人工智能,计算机理论,算法分析,研究方向偏基础,相对来说申请的人数也会少很多,拿奖学金的机会也会比较大。 通过上面对申请斯坦福大学计算机专业条件的解读,相信对于很多计划申请斯坦福大学计算机专业的学生可以参考上面的信息来提前做好申请美国研究生的准备和规划。
热心网友 时间:2022-04-30 07:41
斯坦福大学计算机科学系成立于1965年,至今在数据库和人工智能等各个领域仍旧居于美国乃至世界领先地位。在全球知名的IT公司中,想必大家都听过由四位Stanford校友所创立的SUN公司,公司的名称实际上就是“Stanford University Network”的首字母缩写,而Yahoo公司的创始人杨致远也曾在Stanford大学就读。从某种意义上来说,离开了Stanford大学的支撑, 美国计算机业界的天堂硅谷也未必会成为今日的硅谷。